Provide large datasets - Githubissues

d70-t commented 3 years ago

As an operator of imaging sensors, I want to provide large datasets (i.e. 100 TB and above) to other users. I expect that users do not want to download the datasets entirely, but only access subsets of the data.

joerg-halo commented 3 years ago

I set the "big" label, as there is no simple solution for that, according to my understanding.

joerg-halo commented 3 years ago

Synthesis from tandem session with @rico-hengst, call with @d70-t and discussion within the database workgroup

For the upload of large files alternatives to http should be found. This does not necessarily have to be FTP. Rsynch might be a rather modern alternative (from 1996 as compared to 1971). One advantage of rsynch is that in case of a canceled file transfer, the part which was already transferred does not have to be transferred again. There may also be other (even better) options.
Only self-describing data can be entered to the HALO-DB. Hence, also at non-http uploads, the metadata can be read from the dataset.
Further metadata entries might be useful, e.g. author, license
Datasets which are not classified as "large" should still be dealt with via http
Metadata should be entered manually.
Most often, at the access of a large dataset, only a subset of the dataset will be relevant for the respective data user. Therefore, http seems favourable for the download. Then, the access of subsets of a dataset would be possible dynamically via opendap or statically (e.g. via zarr) - see comment above from @d70-t.
The dynamic solution could use opendap.
We might prefer to go for the static version via zarr. This solution would be faster as a dynamic solution. Furthermore, this solution might be the better choice in the long run as there is no further development of opendap.
Even at the access of subsets of a large dataset comparably large data will be transferred. In order to enable this via http, a faster operation on the HALO-DB is required (see use case #57).
The upload of larger datasets requires a larger storage capacity. This might exceed the capabilities of the current funding model (which is that DLR-IPA pays for the database). Even if an onetime external financing is secured, the running costs might increase. Therefore, other financing schemes might be required. This could go either via regular payments via the HALO running costs or by charging fees for uploading larger datasets. The latter would make large uploads less attractive. Thus, increased costs might promote use case #53 (HALO-DB searches on other databases) by storing larger datasets on data repositories of the institution of the data provider.

d70-t commented 3 years ago

[the most crucial points of the comment should be captured in the comment above]

I don't see how FTP will help (or even be able) to fulfill the requirement to request only subsets of the data. Can you elaborate on that?

Here's my reasoning:

Let's say we have a dataset of one instrument from one campaign which is 100TB in size. As a user, I might want to analyze a certain part of the captured data along a time series of this instrument over the whole campaign period. A typical workflow will look conceptually like the following:

ds = get_dataset("campaign/platform/instrument/Level2.data")
result = ds.brightness_temperature.select(<box_around_center_of_fov>).running_mean("1h")
plot(result)  # this command is the first to actually require data

Thus, as a user, I somehow have to access the whole dataset, but will most likely not have more than some 100GB or so free on my laptop and probably not even the full 100TB on my institutes $WORK directory. It is also unlikely that the dataset creator has shaped the dataset exactly to my needs beforehand, because then my analysis would not be particularly original. However, the selection of the data which I actually require will mostly be far less than the full dataset size.

I am not aware of any smart FTP-based protocols which would allow to integrate with this use case in a way that only the required amounts of data are retrieved. I.e. that obtaining the data is delayed until it is absolutely required and thus the transferred data size can be reduced to what actually is needed.

Apart from those practical issues, I also have doubts that FTP will actually perform better than HTTP. According to the website of Daniel Stenberg (the creator of curl, maybe the most widely used library to obtain things from either HTTP or FTP) FTP has potentially an advantage when a single small static file is requested from a server. For a single larger static file it should be about tie and for many files, HTTP should be the faster one. That said, HTTP/3 is about to become widely used which provides several significant performance improvements. Thus, I'd expect that the advantage of HTTP will grow even more.

Any solution I can think of which would enable a client to retrieve only a client-defined part of a dataset will either require to make a request for a dynamically (i.e. not static) generated resource (like OPeNDAP) or to make many requests to static resources (like zarr or Cloud Optimized GeoTIFF). In both cases, HTTP seems to be in favor of FTP.

joerg-halo commented 3 years ago

This use case treats two sub-cases:

1:1 sharing of large datasets, which is rather straightforward
Uploading a large dataset and downloading a subset of the dataset, which still might be comparably large.

halo-db / storymap

Provide large datasets #51