Classifier data tracking in git leads to large git history

propublica / facebook-political-ads

Monitoring Facebook Political Ads

MIT License

237 stars 50 forks source link

Classifier data tracking in git leads to large git history #71

Closed imalsogreg closed 6 years ago

imalsogreg commented 6 years ago

The initial clone is 2.4G, due to the tracking of updated models in the same repo as the source code.

Would it be possible to host the models in another repository or s3, and fetch the most recent ones when appropriate (during build of a release, or at runtime? I'm not sure yet which is more appropriate for the project)

jeremybmerrill commented 6 years ago

Yes, that's certainly true. It's on my todo list, along with a lot of other things...

(The webapp that serves the database and interacts with the extension doesn't need the models. That's just the classifier task that runs on a cron. So the models absolutely don't need to be in this repo, but I haven't had a chance to excise them.)

imalsogreg commented 6 years ago

@jeremybmerrill I'm happy to be assigned issues from the tracker, rather than putting them on your personal TODO list :) If I were to take the task on, I would open a WIP PR with a plan for where to host the files. But feel free to close the issue if it's not appropriate for an outsider.

jeremybmerrill commented 6 years ago

Hey Greg --

This particular task is probably one that'd be better for me to do, since it's a question of how to get it integrated into our infrastructure. (And I have a workflow from another project to follow.)

If you're interested in picking up some tasks, I will write up some issues and tag you in teh comments.

jeremybmerrill commented 6 years ago

I did this. It's all set up now. See https://github.com/propublica/facebook-political-ads/blob/master/backend/classifier/classifier/commands/get_models.py

imalsogreg commented 6 years ago

FYI the initial clone is still 1.8 Gb due to the git history, but that only bothers new clones.

jeremybmerrill commented 6 years ago

Hmm, damn. I thought I'd fixed that. I'll take another look at it. Thanks!

imalsogreg commented 6 years ago

Fixed!

[greghale@p51:~/code/facebook-political-ads]$ du -h -d1
34M ./.git
4.2M    ./extension
81M ./backend
119M    .