subugoe / metar

Documentation and suggested best practices for data analysis at WAG
https://subugoe.github.io/metaR
MIT License
1 stars 0 forks source link

streamline data situation #53

Open maxheld83 opened 4 years ago

maxheld83 commented 4 years ago

We seem to be running into a similar problem in several projects, including http://github.com/subugoe/hoad/, http://github.com/subugoe/openairegraph/ and the crossref dump situation http://github.com/njahn82/cr_dump/:

There's big-ish (>1MB) serialised data, usually JSON, CSV or the same compressed, which is either/or

(I'm not talking about databases here, that's a separate concern).

These files cause several problems / face limitations:

Possible straightforward solutions might be:

I think we need something else which neatly abstracts away all this. There's probably a good solution out there already.

One avenue to pursue would be git lfs.

Ideally, we should have a solution which understands serialised data, and has a better understanding of diffing rows. (order does not matter).

Anyway, this shouldn't be too complicated and we might start with something small.

I'm going to look into this when I have the time. I think this could save us all a lot of time.

maxheld83 commented 4 years ago

among other things, the repeated downloads of the big dumps via download.file() should be transparently cached.

maxheld83 commented 4 years ago

this would also actually be a feature for a lot of users, who might face the same problem when they run this in CI or collaboratively.