streamline data situation

maxheld83 commented 4 years ago

We seem to be running into a similar problem in several projects, including http://github.com/subugoe/hoad/, http://github.com/subugoe/openairegraph/ and the crossref dump situation http://github.com/njahn82/cr_dump/:

There's big-ish (>1MB) serialised data, usually JSON, CSV or the same compressed, which is either/or

(I'm not talking about databases here, that's a separate concern).

These files cause several problems / face limitations:

Possible straightforward solutions might be:

~~store only locally~~ (no reproducibility)
~~store on a network drive~~ (no reproducibility)
~~setting up a database~~ (too expensive/too much hassle unless absolutely necessary)
store in github releases (not easily automated, very limited storage)
...

I think we need something else which neatly abstracts away all this. There's probably a good solution out there already.

One avenue to pursue would be git lfs.

Ideally, we should have a solution which understands serialised data, and has a better understanding of diffing rows. (order does not matter).

Anyway, this shouldn't be too complicated and we might start with something small.

I'm going to look into this when I have the time. I think this could save us all a lot of time.

maxheld83 commented 4 years ago

among other things, the repeated downloads of the big dumps via download.file() should be transparently cached.

maxheld83 commented 4 years ago

this would also actually be a feature for a lot of users, who might face the same problem when they run this in CI or collaboratively.

subugoe / metar