Open d70-t opened 3 years ago
I set the "big" label, as there is no simple solution for that, according to my understanding.
Synthesis from tandem session with @rico-hengst, call with @d70-t and discussion within the database workgroup
[the most crucial points of the comment should be captured in the comment above]
I don't see how FTP will help (or even be able) to fulfill the requirement to request only subsets of the data. Can you elaborate on that?
Here's my reasoning:
Let's say we have a dataset of one instrument from one campaign which is 100TB in size. As a user, I might want to analyze a certain part of the captured data along a time series of this instrument over the whole campaign period. A typical workflow will look conceptually like the following:
ds = get_dataset("campaign/platform/instrument/Level2.data")
result = ds.brightness_temperature.select(<box_around_center_of_fov>).running_mean("1h")
plot(result) # this command is the first to actually require data
Thus, as a user, I somehow have to access the whole dataset, but will most likely not have more than some 100GB or so free on my laptop and probably not even the full 100TB on my institutes $WORK directory. It is also unlikely that the dataset creator has shaped the dataset exactly to my needs beforehand, because then my analysis would not be particularly original. However, the selection of the data which I actually require will mostly be far less than the full dataset size.
I am not aware of any smart FTP-based protocols which would allow to integrate with this use case in a way that only the required amounts of data are retrieved. I.e. that obtaining the data is delayed until it is absolutely required and thus the transferred data size can be reduced to what actually is needed.
Apart from those practical issues, I also have doubts that FTP will actually perform better than HTTP. According to the website of Daniel Stenberg (the creator of curl
, maybe the most widely used library to obtain things from either HTTP or FTP) FTP has potentially an advantage when a single small static file is requested from a server. For a single larger static file it should be about tie and for many files, HTTP should be the faster one. That said, HTTP/3 is about to become widely used which provides several significant performance improvements. Thus, I'd expect that the advantage of HTTP will grow even more.
Any solution I can think of which would enable a client to retrieve only a client-defined part of a dataset will either require to make a request for a dynamically (i.e. not static) generated resource (like OPeNDAP) or to make many requests to static resources (like zarr or Cloud Optimized GeoTIFF). In both cases, HTTP seems to be in favor of FTP.
This use case treats two sub-cases:
As an operator of imaging sensors, I want to provide large datasets (i.e. 100 TB and above) to other users. I expect that users do not want to download the datasets entirely, but only access subsets of the data.