quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
124 stars 88 forks source link

Remove large files from repo #70

Closed jackjackk closed 3 years ago

jackjackk commented 4 years ago

The current repository size is ~1.2GB. This seems to be mostly due to old binary/data files in the history.

For example, using BFG (see https://help.github.com/en/articles/removing-sensitive-data-from-a-repository for a somewhat similar use case) to delete cPickle, tar.gz, bz2 and csv files (you might want to add more extensions, I have seen also some jars!):

git clone --mirror git@github.com:quaquel/EMAworkbench.git EMAworkbench-small.git
bfg --delete-files '*.{cPickle,tar.gz,bz2,csv}' EMAworkbench-small.git
cd EMAworkbench-small.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -hcs .

you can go down to ~227MB.

In particular, the deleted files would be:

dc4b7438040916526196fbc59972dfdaeb8fa968 4096 ._graph.cPickle
5fae6e63c0ad3a4dc0f60a8ede25535d9b2b8247 556459 100 flu cases no policy.bz2
8df114e26a275980b4b67039b2d4e58c75af64b9 3760117 100 flu cases.cPickle
54558e877add0daafaf24b212c2359deb74635b0 5332974 1000 flu cases no policy.bz2
30ed92441c5c0a10b9ab7af6df12aa320b97a76d 5348084 1000 flu cases no policy.bz2
3784f9cf15fc20959c33f35f34a8d9acab0724e4 56002642 1000 flu cases no policy.cPickle
cb7c679db36a258a25cef0a2f48b7e88616c6370 18659638 1000 flu cases no policy.cPickle
c55082d9ce986011a5a3e10bd165a75659961885 10532970 1000 flu cases no policy.tar.gz
d285e6f37e4e63ed7eeb639595203472e7416101 10528691 1000 flu cases no policy.tar.gz
31bd8492fc43234e9091038291037718390ddbee 12253114 1000 flu cases no policy.tar.gz
b8bd17b6d493b413fb1dc59228d4c0054862c48f 5429770 1000 flu cases no policy.tar.gz
64bfdc05833086cb6342a2fc9cea4075986f2590 10534065 1000 flu cases no policy.tar.gz
078441d3594a67c29673de0f9c65bfccfeafb446 16174361 1000 flu cases with policies.tar.gz
1205859b4598a9d376d13441992b80754889c6c1 31349391 1000 flu cases with policies.tar.gz
db8335beb207c6d5a40ad2457c566269a6a5571d 31352231 1000 flu cases with policies.tar.gz
6e0b222377bfde4920bce3b2d364f9f3df4128e9 31341337 1000 flu cases with policies.tar.gz
d5bff49a0ac760d43f6dc88953a145dc56d0ec8b 15403952 1000 flu cases.bz2
39de77dbc26a9274b87356be9f33cf40fecf0582 15336281 1000 flu cases.bz2
c8e1ca9e0da0d9f27179360c611dfa9339cd0ecb 56002642 1000 flu cases.cPickle
ce1e5033277c3a0a4fcdc73f6fada71da728f3ca 31826531 1000 flu cases.tar.gz
7b5a1ef6dbac6ef296b8e40d160be52657f904ce 35960636 1000 flu cases.tar.gz
e7ab66045399c730c51db67398f1ecdb8c36b229 31167804 1000 flu cases.tar.gz
2bc990b79640dbe18c4c85cc29b0967c0dc46653 36487127 1000 flu cases.tar.gz
fd925b366e4ce9f5670c530ed844a5881122d05d 36530068 1000 flu cases.tar.gz
905389040ff9d59ca2cfce2f9ec6a42ae8fa0a76 39819531 1000 runs scarcity.tar.gz
655fb5f26860f689bdb51e2247b83410486c6020 20208460 1000 runs scarcity.tar.gz
936fe287b3a078ebbe27c03b3cd7151eb1c569c4 39764926 1000 runs scarcity.tar.gz
23988dd2f179152db0ad270e1290e25dec0c2ad9 1591716 100Flu.cPickle
4a13cf0ac7ce521ba8cbd8c877a2de8c0386d6c3 10719415 2000 flu cases no policy.bz2
6d81af966ec83afe83cd2f219ac3b422c59d4a1a 914439 5000 runs WCM 64 bit.bz2
2d926fc155e86211e9a53093f3a5e47e00aefd66 1120117 5000 runs WCM.tar.gz
942a44cc59dd511d5f9c1168523aed1ee05c7eb0 1017269 5000 runs WCM.tar.gz
0a89e45234ac013be4130f73d83631945c4104fc 1016901 5000 runs WCM.tar.gz
a0af28bd4678a46d0182f0b64910b3b0a8dfc5a8 1015536 5000 runs WCM.tar.gz
ae10ad0e7819a2c9b68ba9e53a016527e8ec5014 10265167 CESUN_optimized_1000.cPickle
df87267fa5a18481204eecc8e41ba7ed30f619c5 41322918 CESUN_optimized_1000_new.cPickle
3bc86dc5101ff4004c2eae0f9236f6174738e499 20330640 IWA.cPickle
8d079b16f40b83196f165226a6d65cd3cf653ec8 20025000 Installed Extraction Capacity.csv
7e6319c51f42a2268babe5c6d64de9b2f02a45da 20025000 Installed Recycling Capacity.csv
735157c56e215fdce858df879e81e4db44919482 93595988 JotKE 50000.bz2
914ec17f7e16a1d0912911eb492f33b9972e0c83 125503178 TFSC_corrected.bz2
379e9dcaca0d290713a059dc9957c707e4acbf1b 19225000 TIME.csv
e5c2f080cad7b94126d861c6a667676c8e096fce 1515000 TIME.csv
5eed97a55f8cc0b437cd16d9a3d4cb172364b730 57691790 TIME.csv
00b1287c18411bd78cf6494448261b75dc4038d1 424200 TIME.csv
034d771b1c31dde29ce6a4fca0c046f6e9d69e51 178079 Tester.cPickle
b26806c3eade31237bbb20f87ed7679706fdae32 19270308 a_0.csv
6d377f544b44bfed5744f1f318c866f8ac387cde 19270334 a_1.csv
bc432ef9fa6f29dcf565529bf38c347852208fbd 19269181 a_2.csv
41584949cb709f43df99336dd5ffa8fbbf0a72f6 19270008 a_3.csv
8a233b52655e216d9c0bc6048ca076f1c269f0a3 19268979 a_4.csv
16ec3b6a42ba699a940b708c52e176f5da004d6d 19270172 a_5.csv
f3adc4222118e3e3c8e0b50f79f9fce9055db60c 19268816 a_6.csv
1b67abab671f26c83d996edbd3024cdcd4a5f272 19271312 a_7.csv
93df87b90f59c9a4653cdc500b00ce53f742fb76 19271100 a_8.csv
4269b5b598e66146f4864ff9b7db2013125ae0ce 19270930 a_9.csv
b5e8a22dcdae1cc8cb0c9111f79c8d757d644f2d 51498518 base.cPickle
c5b90a3b1cf6e1bdfc64453ae0ee5fa8bb5610e1 113274 bryant et al 2010 data.csv
c7303b947a3434c407a44cde2daca755492ae9b9 31843015 clustering.cPickle
e49d5b5b8514e0fc68cf266de2b69bfc61eaa422 31863063 clustering.cPickle
0e4291a89ef583322799c8665b0a45acbb21b2ef 8897814 clusters.cPickle
9d58fa0c9547dfc02c3a87159498f87c6c16d4c2 8196152 clusters20.cPickle
81405dabe253e5001e18752b85c8e2206fee5994 3983065 data.cPickle
c4750688bd5bfa1d2979dcb49e82985521a08bd0 19225000 deceased population region 1.csv
10f13491e165472a39055caeb17b1f5845fae739 57691790 deceased population region 1.csv
c4750688bd5bfa1d2979dcb49e82985521a08bd0 19225000 deceased_pop.csv
68338af2537912be94f03ba1e1462c85ae501514 5613179 energy trans 1000 experiments.bz2
d86b7b62ca56d7d1a93c53c8a5c7146463502086 988953 eng_trans.tar.gz
058dee461925c86d6fcf6180522c96efa3c9aad2 1014833 eng_trans.tar.gz
efef718966057fb711ab7ff59c354cef41fd5352 1764513 eng_trans_100.cPickle
3fc5747c93223381a93d400f17a9956c80c383f2 632353 eng_trans_100.cPickle
4ea84ad56f2cfc247142c923bd9e3fb9bcd9bace 1354 experiments metadata.csv
f94a47411f46f303329966a777e0214ff0560e26 779 experiments metadata.csv
acc1bfe9630c40651aa77c1ed22209b61aa0a3f8 11 experiments metadata.csv
06f7213406e0c30c891bd5c223788f579c25dcbb 391891 experiments.csv
9e88d82dd2b8673736e909fda5b3cb9412a6bc05 1199170 experiments.csv
5c94ca002a0678760a6351a6b878f72d2055ad1e 469875 experiments.csv
6a4907b8f354e821b29cc538c24f95660a78f777 611117 experiments.csv
26ea8d433d4c5bec6aa912d7b76a39529f97df68 80004 experiments.csv
0da36b346219f64bd3f2b5556d17ad4628672ddf 391891 experiments.csv
0da36b346219f64bd3f2b5556d17ad4628672ddf 391891 flu_experiments.csv
e96b4e736b3b70b4d48f851d10745922675ebbed 862 flu_uncertainties.csv
bf4daa53bde9f7727f682eab2d717ee8b712aac0 1852 graph.cPickle
828d105c1419999afa869874c26a130881e77c51 57691790 infected fraction R1.csv
4d40bebf304a2b640c85bf9053ea68b2210aa5fd 19225000 infected fraction R1.csv
7adbc90143225c370c4303480f8660269bfbf980 15712 minerals and metals network.cPickle
c7370740b8c77198776e339c899f65c593dd1e76 12750 model of Oliva.cPickle
6a5b6c669226a3ba19626b0e1e19faaf6b236fee 94 outcomes metadata.csv
b65f2892e7866aa36439b3854afa7b77ec6375a3 18 outcomes metadata.csv
401ce78e89c81ec34e836dfc6573dbeabf25036b 387750 predatorPreyNetlogotemp.csv
f301b4e0c95912e9fb8b872621ddec20aeede7f9 382729 predatorPreyNetlogotemp.csv
2b56ec21d32f469052ec71c908024983604da451 352593 prim data 100 cases.bz2
3f4f40010da38339e836a3455dbcf2a5ca14fbdf 1018400 prim data 100 cases.cPickle
2680d38994c958fb3afcbba69ab978eb785ea9bf 20025000 produced of intrinsically demanded.csv
80813bc041fcf4a5d0785886004f2d243267685a 20025000 real annual demand.csv
74d4969eb0d6be28b975f8eb44e2efc3a1501d57 20025000 relative market price.csv
74d4969eb0d6be28b975f8eb44e2efc3a1501d57 20025000 relative_market_price.csv
4d6d30cba564427cb27e75c09e5357abdaa7edeb 3044090 result100copperBottumUp.cPickle
3837cd8e0b48ec64d0b15aeb9489d881e3fbe144 26647 robust optimization results.cPickle
e40d7d67de154f3403ee22a9ad25adca86c842f5 27000 robust optimization results.cPickle
4d32567126c2918137ffd0509956712e558d4468 3149 robust test.bz2
d51948e1c822ce26f4631327f78a131fa8cf5579 19702750 scarcity 1000.bz2
e5ad750957c1758f7215e689d277f4a90b474796 21111243 scarcity 1000.bz2
fc80eb6b4a1ae5e79def3f21911e11546f9cf4fc 56552 storedresults.cPickle
c161fe3aedb41e86a220ca80869083f53d5de548 1713816 storedresults_2.cPickle
7973e4c7ebc2e485d5de3f36a8ed7cfe25d538e8 2870003 storeduncertainties.cPickle
38ccfa81aa121639058517a2c14feeb71f78c212 20025000 supply demand ratio.csv
e304eb4f8fe9b40194a0d14d0837fc582531d2d0 20025000 supply.csv
a4bf3f8870e842f80ab5ab56e15a91a87bbb19a2 2030 test optimization save.bz2
3d107b0c7b8b884aa4ea80200944c5594e918222 1114 test optimization save.bz2
121b4790e99513314b1d7d3e1b3984eab4e7dce2 29 test.tar.gz
eee64e3e837616c4fbe3a1f4281c89997b4eae31 29 test.tar.gz
8c317280d5a6eb54c011f560fb18a1405c4986ec 29 test.tar.gz
4b71632ed465195c511172a559e62c60499b1e75 29 test.tar.gz
26ad2702a8c0b0fbb5ee330a6c1a95618d4048b2 86016 test.tar.gz
a3b7b733b42b4c06ef3a119b8ff7c762134d0cdb 29 test.tar.gz
16e4ec6ad47d9bc8ea41e512956b2cbb075d5be4 29 test.tar.gz
a7b2d31f8b9de909a9e5e09313af6919c2ad3b0b 516924 test_cluster.cPickle
379e9dcaca0d290713a059dc9957c707e4acbf1b 19225000 time.csv
35069bf3e11dc241e8232f23b0a145f813aae5a8 20025000 time.csv
155e5a193bc37e423a69ad337cd246a977a56a4e 1515000 total capacity installed.csv
79d4a51cff8b9b5b08442ab1fb0cd7d02829bfbf 791012 total capacity installed.csv
51e912d43766b19ee0628d0a441fcdd668650e0b 904247 total fraction new technologies.csv
ac9021cccd432ef70a0d868c580f610223c650fb 1515000 total fraction new technologies.csv

while the protected files (because belonging to the HEAD commit), would be:

b8bd17b6d493b413fb1dc59228d4c0054862c48f,DELETE,regular-file,ema_workbench/examples/data/1000 flu cases no policy.tar.gz,5429770,
078441d3594a67c29673de0f9c65bfccfeafb446,DELETE,regular-file,ema_workbench/examples/data/1000 flu cases with policies.tar.gz,16174361,
942a44cc59dd511d5f9c1168523aed1ee05c7eb0,DELETE,regular-file,ema_workbench/examples/data/5000 runs WCM.tar.gz,1017269,
c5b90a3b1cf6e1bdfc64453ae0ee5fa8bb5610e1,DELETE,regular-file,ema_workbench/examples/data/bryant et al 2010 data.csv,113274,
e96b4e736b3b70b4d48f851d10745922675ebbed,DELETE,regular-file,ema_workbench/examples/models/flu/flu_uncertainties.csv,862,
c55082d9ce986011a5a3e10bd165a75659961885,DELETE,regular-file,test/data/1000 flu cases no policy.tar.gz,10532970,
905389040ff9d59ca2cfce2f9ec6a42ae8fa0a76,DELETE,regular-file,test/data/1000 runs scarcity.tar.gz,39819531,

Use caution when applying these changes to the github repo with a push! Just in case, make sure you have a backup. If you decide to proceed with a git push, ask also to clone the repository again to those who might have push rights (to avoid overwriting the history back).

You might also want to check all the binary files currently present in the repository (which are not perfectly suitable to be tracked for changes unless sth like https://git-lfs.github.com/ is used) and do some cleanup before the commands above.

docs/source/ystatic/boxes_individually.png
docs/source/ystatic/boxes_together.png
docs/source/ystatic/envelopes.png
docs/source/ystatic/envelopes3d.png
docs/source/ystatic/flu-model.png
docs/source/ystatic/lines.png
docs/source/ystatic/lines2.png
docs/source/ystatic/logo.png
docs/source/ystatic/multiplot-flu-adaptive-policy.png
docs/source/ystatic/multiplot-flu-no-policy.png
docs/source/ystatic/multiplot-flu-static-policy.png
docs/source/ystatic/multiplot.png
docs/source/ystatic/prim_flu_example.png
docs/source/ystatic/prim_visual.png
docs/source/ystatic/simpleVensimModel.png
docs/source/ystatic/tutorial-lines.png
ema_workbench/examples/data/1000 flu cases no policy.tar.gz
ema_workbench/examples/data/1000 flu cases with policies.tar.gz
ema_workbench/examples/data/5000 runs WCM.tar.gz
ema_workbench/examples/models/burnout/BURNOUT.vpm
ema_workbench/examples/models/burnout/Current.vdf
ema_workbench/examples/models/excelModel/excel example.xlsx
ema_workbench/examples/models/flu/Current.vdf
ema_workbench/examples/models/flu/FLUvensimV1basecase.vpm
ema_workbench/examples/models/flu/FLUvensimV1dynamic.vpm
ema_workbench/examples/models/flu/FLUvensimV1static.vpm
ema_workbench/examples/models/scarcity/Current.vdf
ema_workbench/examples/models/scarcity/MetalsEMA.vpm
ema_workbench/examples/models/vensim example/Current.vdf
ema_workbench/examples/models/vensim example/model.vpm
test/data/1000 flu cases no policy.tar.gz
test/data/1000 runs scarcity.tar.gz
test/data/BURNOUT.vpm
test/data/eng_trans.tar.gz
test/models/CESUN_optimized_new.vpm
test/models/Current.vdf
test/models/FLUvensimV1basecase.vpm
test/models/MetalsEMA.vpm
test/models/lookup_model.vpm
test/models/model.vpm

In general, you might want to consider a separate repository for larger datasets/binary files, or other web/db hosting to refer to, as this will keep the code history more maintainable.

quaquel commented 4 years ago

I have been wondering about this for a while, so thanks a lot for the suggestion. Do you know how this would affect any branches? Or can you simply merge master into say your new development branch to benefit from the smaller size?

regarding binaries etc.., in principle I agree but having the datafiles for the examples in the tree is much more convenient for novice users. Otherwise, they would have to download a separate set of datafiles and copy them to the right location (or do something like altair where you have a separate repo with the datafiles, see https://altair-viz.github.io/getting_started/installation.html)

quaquel commented 3 years ago

finally gotten around to do this. Worked like a charm. All old large files are now removed and repo size is down tremendously.