tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

Usecases question #48

Open aplavin opened 1 week ago

aplavin commented 1 week ago

Nice to see a modern take on datasets handling in Julia! I've been looking at DataToolkit trying to understand how to apply it and what specific advantages would it bring. I have three different usecases in mind, and cannot really understand how to plug DataToolkit in any of them. Briefly outlining them below, any suggestions are welcome!

  1. Small and sporadically-updated table, like 500 rows. Currently, I just put a CSV file into a data folder of the Julia package, and provide a function that reads it into a Julia table with some minor cleanup. What can DataToolkit improve here?

  2. An online collection of publicly-available tables. For a specific example, astronomical catalogs at https://vizier.cds.unistra.fr. Currently (in VirtualObservatory.jl) I provide a function that's basically download_and_read_table(catalog_id::String) with some conveniences. There are obvious issues with that: every time the dataset is downloaded anew, and one cannot access a dataset without internet/when the archive is down even if it was downloaded previously. Some transparent caching would be nice.

  3. A large well-structured collection of files (tables, images, ...), think hundreds of GBs. Currently, I manually ensure that the collection is available on the machine I need to work at, and have an interface like MyDataset("path-to-the-directory"). Would be nice to have a per-machine config file so that the path is defined here, and then MyDataset() automatically finds it. Also, maybe some basic presence/sanity checks...