Implement GitLFS - Githubissues

terryf82 commented 5 years ago

Our current approach of building local data files (segments, standardized crashes & concerns, predictions and so on) and using data.world for storage is starting to become a development issue.

I thought it might be worthwhile for everyone to list ideas about where they think the current setup has strengths & weaknesses, and some ideas of alternatives.

My initial thoughts: strengths:

allows for rapid development, files can be created as needed
reading local files means cities can be developed & tested without the need for access to a remote service
data.world is free, like all of the other tools we're using

weaknesses:

reading local files is likely to have performance issues, especially once we start enabling viz-based filtering / searching
data.world has storage limits which we are close to hitting
there isn't any structure around data revisioning, 'data-latest.zip' just gets replaced continuously which will become problematic if different branches require different file states

alternatives:

moving to Git Large File Storage (LFS - https://www.youtube.com/watch?v=uLR1RNqJ1Mw):
1. would mean we could keep building & using local data files for the time being, though that itself is likely to become a performance issue eventually
2. would give us access to data file revisions in the same way we do for our source code
a network database:
1. would be a step towards a formal deployment infrastructure
2. would allow for more sophisiticated interrogation of the data and discovery of relationships / trends

Please add your ideas below, thanks.

alicefeng commented 5 years ago

Adding on:

strengths:

part of a larger data hosting ecosystem (i.e., other people looking for open data might discover our project)

weaknesses:

lack of flexibility. it's incredibly annoying to have to upload/download the entire zipfile just to update one single dataset in it, especially given how large the zipfile is
no way to [easily] track what in data-latest.zip got modified from the previous version

terryf82 commented 5 years ago

There is a new branch in the repo _lfsdata which has the entire /data folder under version control using GitLFS.

Steps to configure this were:

install git lfs (https://git-lfs.github.com/)
remove /data from .gitignore
enable LFS tracking of all files in /data by running git lfs track "data/**" (this creates a .gitattributes file which is now also under version control)
commit /data, .gitignore and .gitattributes and push to the _lfsdata branch
wait half an hour while 1gb of files is uploaded =)

I committed the files from data-latest-2018-08-26.zip from data.world, which contains the preds_final.json files for each of the 3 cities - https://github.com/Data4Democracy/crash-model/tree/lfs_data/data

You can see the "Stored with Git LFS" message when viewing files in the repo, e.g https://github.com/Data4Democracy/crash-model/blob/lfs_data/data/boston/raw/crashes/vzopendata.csv

Next step is for someone to install LFS on their machine, and then checkout the _lfsdata branch (if you check it out prior to installing LFS, you won't get the files, according to the documentation).

One last note - when I tried to push the commit to GitHub using the GUI client SourceTree on Mac, I hit an error about file locking. After some research I ran the command git config lfs.locksverify false which adds an entry to the git config file, and the error went away. But then I removed that new config and pushed another commit, this time using the console, and it worked without error. So I guess give it a go without running this step first and use it if necessary. Thanks.

@alicefeng @bpben @j-t-t @mtjones330

j-t-t commented 5 years ago

This seems to work, although yeah, that was really slow. Maybe we should just have the zipped file under version control instead of the whole directory structure? It's just going to keep getting bigger as we add more cities...

terryf82 commented 5 years ago

If we create an archive, that'll mean the whole thing will need to be downloaded whenever a single file changes. Or are you talking about zipping individual files?

Now that you have the files locally from the repo, that should be the worst of the delays done anyway?

j-t-t commented 5 years ago

I see your point about having to download them, although that's what we're currently doing. I think typically we won't see only a single file change though: when we update one map, we're going to update most or all of them. I'm just worried that git isn't going to be happy with us using this much storage, and it's going to double soon at the rate we're going. I'm willing to give this a try as-is, but the sheer size may end up being something that causes problems.

terryf82 commented 5 years ago

From the github.com documentation:

We recommend repositories be kept under 1GB each. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down. In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."

Using GitLFS removes the constraint of 100mb per file (I got an error message about this prior to having LFS working properly). If/when GitHub contact us about removing the big files, then we may need to adjust our plan, but until that happens I don't know of a better solution? We could potentially removed the /data/[city]/processed folder from version control as that is probably the worst offender, but that would mean running the pipeline locally for whatever city you wanted to predict on when changes are made to the process.. not the worst outcome I guess.

@bpben @alicefeng @mtjones330 @shreyapandit any thoughts on this?

j-t-t commented 5 years ago

I think we should continue to store the processed data somewhere, because otherwise different people might end up developing on different maps. I still think zipping the files up, while not ideal, might be the best option here. We can just zip each city individually though, so that when people add new cities, or update data sources for new cities, we wouldn't need to get the whole zip. But when we make code changes that require rerunning the pipeline, that would require a full update, yes.

terryf82 commented 5 years ago

@j-t-t I think your idea of city-specific zips is the best strategy for now. This lets us get both input & output files under version control, both of which need compression if we're going to stick to a reasonable repo footprint (without compression /data for our 3 initial cities is at 1.41gb, with bzip2 compression it drops to 127mb). I've gone ahead and run each city's folder through tar & bzip2 and pushed the changes to the _lfsdata branch.

Do we have consensus on using this approach now? The only real change to the workflow required is unzipping the archive if you want to run a city, and then rezipping if you make changes. Could / would that be something we want to add to the pipeline?

j-t-t commented 5 years ago

I'm not sure about adding it to the pipeline. Well, unzipping, maybe. But I'm not sure that this is the best way to ensure that the zipped files are up to date, because if we make code changes that will affect all the cities' maps, you'd have to rerun all those cities, and at the moment the pipeline doesn't support that.

I guess we could have another script that runs all cities for which we have config files, and if you run it with the --forceupdate flag, then rebundle everything.

If we go with the zip each city model, we'd also want to add each city's subdirectory into .gitignore. Although that means that each new added city would also have to be added to .gitignore, so perhaps we'd want to have a data/ directory that things get unzipped into that is ignored, and a zipped-data/ directory that is under version control.

terryf82 commented 5 years ago

PR to merge in our new city-based archiving strategy - https://github.com/Data4Democracy/crash-model/pull/188

terryf82 commented 5 years ago

@j-t-t I made some updates to the documentation that should cover off how to use GitLFS, let me know if you spot any problems. Thanks.

terryf82 commented 5 years ago

PR has been merged to master and this is now complete.

terryf82 commented 5 years ago

@j-t-t this P.R re-introduces GitLFS into master with the setup we were discussing - https://github.com/Data4Democracy/crash-model/pull/199

insight-lane / crash-model

Implement GitLFS #179