tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

Multi-file loader #39

Open jfb-h opened 4 months ago

jfb-h commented 4 months ago

As recently discussed on Zulip, it would be nice to have a loader which allows loading multiple files that have the same schema, which is already supported by e.g. CSV.jl or Arrow.jl. So I thought I'd make an issue to track this :)

tecosaur commented 1 month ago

Thanks for the issue, it will probably take a while for me to get to this properly, but for the record this is rolling around in the back of my mind.

I want to handle this, but also handle it properly (use a cached merkle-tree hash for starters, but more thought is needed).

tecosaur commented 1 month ago

I'm thinking more on this, and specifically having a directory. I'm wondering if introducing a DirPath as a counterpart to FilePath could be a good way of handling this.

jfb-h commented 1 month ago

That sounds sensible. Would you then chain a directory loader and a specific file loader? Or would you just pass the directory to a loading function which is then free to process its contents in any way?

tecosaur commented 2 weeks ago

We now have DirPath! :partying_face:

This is a big step, and it's been done properly: merkle tree hashing for integrity, with caching to avoid long waits for repeated work on each access/check.

Now we have an easy way to arrive at a collection of items, we can start thinking about the next step: how to handle them in bulk...