Closed terryf82 closed 5 years ago
Adding on:
strengths:
weaknesses:
There is a new branch in the repo _lfsdata which has the entire /data folder under version control using GitLFS.
Steps to configure this were:
I committed the files from data-latest-2018-08-26.zip from data.world, which contains the preds_final.json files for each of the 3 cities - https://github.com/Data4Democracy/crash-model/tree/lfs_data/data
You can see the "Stored with Git LFS" message when viewing files in the repo, e.g https://github.com/Data4Democracy/crash-model/blob/lfs_data/data/boston/raw/crashes/vzopendata.csv
Next step is for someone to install LFS on their machine, and then checkout the _lfsdata branch (if you check it out prior to installing LFS, you won't get the files, according to the documentation).
One last note - when I tried to push the commit to GitHub using the GUI client SourceTree on Mac, I hit an error about file locking. After some research I ran the command git config lfs.locksverify false which adds an entry to the git config file, and the error went away. But then I removed that new config and pushed another commit, this time using the console, and it worked without error. So I guess give it a go without running this step first and use it if necessary. Thanks.
@alicefeng @bpben @j-t-t @mtjones330
This seems to work, although yeah, that was really slow. Maybe we should just have the zipped file under version control instead of the whole directory structure? It's just going to keep getting bigger as we add more cities...
If we create an archive, that'll mean the whole thing will need to be downloaded whenever a single file changes. Or are you talking about zipping individual files?
Now that you have the files locally from the repo, that should be the worst of the delays done anyway?
I see your point about having to download them, although that's what we're currently doing. I think typically we won't see only a single file change though: when we update one map, we're going to update most or all of them. I'm just worried that git isn't going to be happy with us using this much storage, and it's going to double soon at the rate we're going. I'm willing to give this a try as-is, but the sheer size may end up being something that causes problems.
From the github.com documentation:
We recommend repositories be kept under 1GB each. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down. In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."
Using GitLFS removes the constraint of 100mb per file (I got an error message about this prior to having LFS working properly). If/when GitHub contact us about removing the big files, then we may need to adjust our plan, but until that happens I don't know of a better solution? We could potentially removed the /data/[city]/processed folder from version control as that is probably the worst offender, but that would mean running the pipeline locally for whatever city you wanted to predict on when changes are made to the process.. not the worst outcome I guess.
@bpben @alicefeng @mtjones330 @shreyapandit any thoughts on this?
I think we should continue to store the processed data somewhere, because otherwise different people might end up developing on different maps. I still think zipping the files up, while not ideal, might be the best option here. We can just zip each city individually though, so that when people add new cities, or update data sources for new cities, we wouldn't need to get the whole zip. But when we make code changes that require rerunning the pipeline, that would require a full update, yes.
@j-t-t I think your idea of city-specific zips is the best strategy for now. This lets us get both input & output files under version control, both of which need compression if we're going to stick to a reasonable repo footprint (without compression /data for our 3 initial cities is at 1.41gb, with bzip2 compression it drops to 127mb). I've gone ahead and run each city's folder through tar & bzip2 and pushed the changes to the _lfsdata branch.
Do we have consensus on using this approach now? The only real change to the workflow required is unzipping the archive if you want to run a city, and then rezipping if you make changes. Could / would that be something we want to add to the pipeline?
I'm not sure about adding it to the pipeline. Well, unzipping, maybe. But I'm not sure that this is the best way to ensure that the zipped files are up to date, because if we make code changes that will affect all the cities' maps, you'd have to rerun all those cities, and at the moment the pipeline doesn't support that.
I guess we could have another script that runs all cities for which we have config files, and if you run it with the --forceupdate flag, then rebundle everything.
If we go with the zip each city model, we'd also want to add each city's subdirectory into .gitignore. Although that means that each new added city would also have to be added to .gitignore, so perhaps we'd want to have a data/ directory that things get unzipped into that is ignored, and a zipped-data/ directory that is under version control.
PR to merge in our new city-based archiving strategy - https://github.com/Data4Democracy/crash-model/pull/188
@j-t-t I made some updates to the documentation that should cover off how to use GitLFS, let me know if you spot any problems. Thanks.
PR has been merged to master and this is now complete.
@j-t-t this P.R re-introduces GitLFS into master with the setup we were discussing - https://github.com/Data4Democracy/crash-model/pull/199
Our current approach of building local data files (segments, standardized crashes & concerns, predictions and so on) and using data.world for storage is starting to become a development issue.
I thought it might be worthwhile for everyone to list ideas about where they think the current setup has strengths & weaknesses, and some ideas of alternatives.
My initial thoughts: strengths:
weaknesses:
alternatives:
moving to Git Large File Storage (LFS - https://www.youtube.com/watch?v=uLR1RNqJ1Mw):
a network database:
Please add your ideas below, thanks.