oxinabox / DataDeps.jl

reproducible data setup for reproducible science
Other
150 stars 43 forks source link

RFC: DataSets.jl intergration #144

Open johnnychen94 opened 2 years ago

johnnychen94 commented 2 years ago

DataDeps is quite reliable to download the content, but it can sometimes be troublesome to manage a dataset registry if datasets information are hardcoded as Julia source code. I think DataSets.jl toml specification is quite promising to eventually provide a general registry for datasets, so maybe we can just integrate DataDeps into DataSets as a downloading driver.

This is just a proof of concept, I want to know your thoughts on this before I start polishing the details.

julia> using DataSets, DataDeps

julia> ds = dataset("Pi")
name = "Pi"
uuid = "76e82d27-3914-4472-abb6-38d476023211"

[storage]
driver = "DataDeps"
remote_path = ["https://www.angio.net/pi/digits/10000.txt"]
sha256 = "1f08958df70c30bbf3f7205ad5f9b2b9430e3378b4e4e6300063bd9fbd83d6e3"
localdir = "~/.julia/datadeps/Pi"

julia> ds_blobtree = open(ds)
┌ Info: Downloading
│   source = "https://www.angio.net/pi/digits/10000.txt"
│   dest = "/Users/jc/.julia/datadeps/Pi/10000.txt"
│   progress = NaN
│   time_taken = "0.0 s"
│   time_remaining = "NaN s"
│   average_speed = "1.908 MiB/s"
│   downloaded = "9.767 KiB"
│   remaining = "∞ B"
└   total = "∞ B"
📂 Tree . @ /Users/jc/.julia/datadeps/Pi
 📄 10000.txt

julia> read(ds_blobtree["10000.txt"], String)
"3.14159265358979
...

cc: @c42f

oxinabox commented 2 years ago

Cool. I have not yet got to findout about DataSets.jl. 2 questions:

  1. What value in particular does DataDeps add here?
  2. Should this live here, or in DataSets, or in a 3rd package. (Right now DataDeps is very stable, and very low maintaince. Which is about all i have time for with everthing else I do)
johnnychen94 commented 2 years ago

What value in particular does DataDeps add here?

A good question to me. I'm not very sure if I understand the design goal of DataSets.jl, I originally thought it is a way to separate data source configuration and data source fetching implementation. But then I checked the https://github.com/JuliaComputing/DataSets.jl/issues/6 and it seems that DataSets also wants to solve the how-to-open issue, for that part, DataDeps can't help at all.

c42f commented 2 years ago

Very nice, thanks for this proof of concept!

I originally thought it is a way to separate data source configuration and data source fetching implementation

Yes, this is one design goal. The benefit of having datasets integration is that you can use the DataSets API to process data which happens to be stored as a DataDep. But equally well use that same API to process a dataset that is stored in various other ways. For example, inside an Artifact, on local disk, fetched on demand from S3, etc.

it seems that DataSets also wants to solve the how-to-open issue, for that part, DataDeps can't help at all.

This is true, but it's not a problem. IIUC a DataDep always presents data as a directory. That would be reflected into Julia as the DataSets.BlobTree type and the program should expect that tree-like type.

Should this live here, or in DataSets, or in a 3rd package.

I think either here or in a 3rd package — DataSets itself can't take dependencies on all storage backends, as there's essentially an unlimited number of those. We could bless a few storage backends, but for now I've limited that to the filesystem and data embedded along with the metadata.

oxinabox commented 2 years ago

Maybe DataSets just plain replaces DataDeps? Maybe it is the new hotness that people should be using going forwards. I thought Artifacts might be that, but Artifacts are not suitable if your data is not somewhere public server as a gzipped tarball over plain HTTPS.

e.g. a Artifact from BinaryBuilder is good for artifacts; A zip file stored on S3, accessed via using a AWSS3.S3Path as the url is not good for artifacts, but is good for DataDeps. I suspect such a thing might be able to be made good for for DataSets? S3Path supports Base.download and also IIRC it plugs into FilePathsBase so that a path beginning s3:// gets turned into a S3Path. So it would be nice if that was enough to make it "just work". If not, we could talk about adding the support that is needed into AWSS3.jl

johnnychen94 commented 2 years ago

I'm thinking of adding Downloads to DataSets.jl, and then provide a FileIO-like downloader registry (also in DataSets.jl) to support lazy package loading for the driver packages. Does this sound good to you?

If this sounds good, then I now believe we need a new optional entry "downloader" that could co-exists with "driver". By default, when "downloader" is not provided, it uses whatever available downloader registered in DataSets.jl to ensure the data exists, and then use the specified "driver" to open the data. This way we don't need to hardcode the open output as BlobTree.

For downloading support, we may also need to consider some streaming cases via HTTP API.

c42f commented 2 years ago

A zip file stored on S3, accessed via using a AWSS3.S3Path as the url is not good for artifacts, but is good for DataDeps. I suspect such a thing might be able to be made good for for DataSets?

Yes, this is exactly the kind of thing DataSets is good for (also mapping in S3 prefixes as BlobTrees works).

FileIO-like downloader registry (also in DataSets.jl) to support lazy package loading for the driver packages

DataSets already has something like this for drivers :-D See https://github.com/JuliaComputing/DataSets.jl/pull/20

Note that it's very easy to run into world age issues with lazy loading.

If this sounds good, then I now believe we need a new optional entry "downloader" that could co-exists with "driver". By default, when "downloader" is not provided, it uses whatever available downloader registered in DataSets.jl to ensure the data exists, and then use the specified "driver" to open the data.

From my point of view, the "downloader" is the "driver" — they're not separate. If that seems strange, consider that DataSets doesn't require immutable nor local data — its model is not "download everything one time and use the local cache after that". It could make sense to have a driver type which has "downloader" as a sub-key and makes some strong assumptions about the API of that downloader. But I suspect we'll need custom shim code for each downloader module so making this purely declarative might not work (then we're back to "just write a different driver for each downloader"). There's probably some common cache management code we should have if we're going to support multiple remote backends.

BTW we've been discussing similar use cases over at https://github.com/JuliaComputing/DataSets.jl/issues/26#issuecomment-895218684. I think it would be neat to unify the use cases of RemoteFiles.jl and DataDeps.jl within a common remote downloading interface in DataSets.jl. DataSets.jl can certainly take a dependency on Downloads.jl to make this all work!