lab-cosmo / i-pi-dev_archive

Development version of i-PI
21 stars 12 forks source link

Repository bloat #178

Open OndrejMarsalek opened 7 years ago

OndrejMarsalek commented 7 years ago

Because the repository never forgets, it easily bloats with data that is checked in and then removed again. Currently, it has around 220 MB, while the working tree is only around 35 MB. I tried looking for some resources that could help and found this:

https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

Running the script, I get a list that starts with the below listing. I think we should try and filter most of these from the history. The end of this list sorted by size is around 1 MB, so looking even further might still make sense. If we don't maintain a separate repository for examples, we need to be a bit careful so that we don't make people download hundreds of MBs if they want to run a simple simulation.

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
30565  8618   0c12d4a625b53bca605d080b3ef03514d614a9c6  examples/ppi/qtip4pf/qtip4pf.pos_7.xyz
30565  8617   19aea3e92732eb3721a38cfcf909a5372ad8c7bc  examples/ppi/qtip4pf/qtip4pf.pos_5.xyz
30565  8618   2a1c03bc59f33d6095a663541892668fd93ae724  examples/ppi/qtip4pf/qtip4pf.pos_2.xyz
30565  8617   3041cf398fb10fa955b34971bb88664f334f8965  examples/ppi/qtip4pf/qtip4pf.pos_1.xyz
30565  8618   72ecdcbb1d92a474a12411ffee7097495a40895a  examples/ppi/qtip4pf/qtip4pf.pos_4.xyz
30565  8619   c1e61916a1dda9c95133da4c734a4e1d64cdf54a  examples/ppi/qtip4pf/qtip4pf.pos_0.xyz
30565  8618   c590357735fca50c993f3973e3bd54ea8010d02f  examples/ppi/qtip4pf/qtip4pf.pos_6.xyz
30565  8617   ffc3f0c99c22b9a5dc6c5bffc8de767c19b7d219  examples/ppi/qtip4pf/qtip4pf.pos_3.xyz
30564  9198   12e029354bca5043f93fb8436b5a1ccbfea65c81  examples/ppi/qtip4pf/qtip4pf.force_3.xyz
30564  9197   31d1f02d9a1ce3f0224bcc221e0769b1c51df38d  examples/ppi/qtip4pf/qtip4pf.force_0.xyz
30564  9197   6e1ba45f6efc329e1eeae1e90caba4dc5efd45ff  examples/ppi/qtip4pf/qtip4pf.force_5.xyz
30564  9196   a11d414c934f211410b606ec6a10b4c6e15ab460  examples/ppi/qtip4pf/qtip4pf.force_6.xyz
30564  9196   a4311bf1b268d6d58e25f259a364d4f7438cda4e  examples/ppi/qtip4pf/qtip4pf.force_2.xyz
30564  9197   b9b993ac2df51a380ae5b65da42cc7da389cac13  examples/ppi/qtip4pf/qtip4pf.force_7.xyz
30564  9198   c55b9aba770369b4c6516441b553834202b519d3  examples/ppi/qtip4pf/qtip4pf.force_4.xyz
30564  9197   e7281c1b170259d7f71bacb2ec2c05fe5e6e27d3  examples/ppi/qtip4pf/qtip4pf.force_1.xyz
17721  3605   a79efe62102e38b34e29fed219f00afda8a1892d  examples/ppi/qtip4pf/benchmark/qtip4pf.energies.dat
15500  4339   71032e78a592a092937c30b2a6ce771506e8c5ff  examples/lammps/h2o-mts/MTS-Ensemble/trial-01/rpc.pos_0.xyz
13785  13162  992bda057bd0e8dd4b9fb3fafa1b39d2f0f5e2f6  data/diss-zurich10.pdf
8995   1411   25347623730ac47702369201032ad73f7aa80cda  test/ph2/test_ph2.pdb
6608   6392   fd3d30f6b2a249968f88e2157aebda584e837de2  movies/ice-cage.flv
4345   1760   cdceced9914f6622f14a1878eca1d14791b33a39  examples/lammps/paracetamol-einstein/input.xml
3374   3179   1d1fd08788b781080445763eb970b1c9eb6b5dd5  data/ceri14psik-highlight.pdf
3224   1463   34d41828d417b38a91534326c4acdd7270f4f3f9  examples/lj/nst/reference/lj-nst.pos_0.pdb
3122   2993   cff1c905a7651caa25aac7f44f2e775a50096794  data/i-pi_1.0.zip
2902   2769   8d650c408757239e49ab82138bb63c5ec124028a  data/tut-lugano10.pdf
2085   1959   9c8e64ebb9605c8f3849c71cd83ea450a9a703e1  data/lugano10.pdf
1972   266    49a85bc43c83197f25d01fb6d17f9c3638dcf93a  examples/lammps/newdyn/nst-ice.xc.pdb
1913   333    3b8aea8aae32e049510f9f41c5ffe8951182bd59  examples/cp2k/basis/dftd3.dat
1859   158    4416e8913a2d64baaa44c126ce9cc43c21ce16bf  examples/lammps/h2o-mts/MTS-Ensemble/trial-04/log.lammps
1787   1003   1ae7d88e2ac652f70a6587c768447c0d2e61c25e  examples/lj/nst/reference/lj-nst.pos_0.pdb
1736   1011   037343786bd33f7d9e74af45b0ad2512e6b0c7e1  examples/lj/nst/reference/lj-nst.pos_0.pdb
1556   431    57652c560fac09ed25128b1db0058ee993a31202  examples/lammps/h2o-mts/MTS-Ensemble/trial-04/rpc.pos_0.xyz
1344   814    14266545d35648a107bbe12aba9852e466f2a5e9  examples/lj/nst/reference/lj-nst.pos_0.pdb
1228   1197   3e63eeeddbb442bb3040a33195a5a23de7668d64  images/header-homepage.jpg
1213   1183   b82671699bbf75cbe79fcab1dbc88924612282b4  data/i-pi_hands-on.zip
1034   419    21e92a4ed33f15d863ce7798449a506eb53da4f8  examples/lammps/paracetamol-phonons/simulation-fd.dynmat
996    411    5bcaf53662751881fd0f3a261046992301e4aab8  examples/lammps/paracetamol-debye/hessian.data
989    409    da9da9f135199c5f408cd06d1d526f4cf893c843  examples/lammps/paracetamol-phonons/simulation-fd.hess
950    404    2119abc2cd589b225fac326fb5e99cd3d720bc5c  examples/lammps/paracetamol-phonons/simulation-fd.mode
ipoltavskyi commented 7 years ago

We can completely delete that benchmark folder.

Best regards Igor

On 17 Apr 2017, at 11:48, Ondrej Marsalek notifications@github.com wrote:

Because the repository never forgets, it easily bloats with data that is checked in and then removed again. Currently, it has around 220 MB, while the working tree is only around 35 MB. I tried looking for some resources that could help and found this:

https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

Running the script, I get a list that starts with the below listing. I think we should try and filter most of these from the history. The end of this list sorted by size is around 1 MB, so looking even further might still make sense. If we don't maintain a separate repository for examples, we need to be a bit careful so that we don't make people download hundreds of MBs if they want to run a simple simulation.

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file. size pack SHA location 30565 8618 0c12d4a625b53bca605d080b3ef03514d614a9c6 examples/ppi/qtip4pf/qtip4pf.pos_7.xyz 30565 8617 19aea3e92732eb3721a38cfcf909a5372ad8c7bc examples/ppi/qtip4pf/qtip4pf.pos_5.xyz 30565 8618 2a1c03bc59f33d6095a663541892668fd93ae724 examples/ppi/qtip4pf/qtip4pf.pos_2.xyz 30565 8617 3041cf398fb10fa955b34971bb88664f334f8965 examples/ppi/qtip4pf/qtip4pf.pos_1.xyz 30565 8618 72ecdcbb1d92a474a12411ffee7097495a40895a examples/ppi/qtip4pf/qtip4pf.pos_4.xyz 30565 8619 c1e61916a1dda9c95133da4c734a4e1d64cdf54a examples/ppi/qtip4pf/qtip4pf.pos_0.xyz 30565 8618 c590357735fca50c993f3973e3bd54ea8010d02f examples/ppi/qtip4pf/qtip4pf.pos_6.xyz 30565 8617 ffc3f0c99c22b9a5dc6c5bffc8de767c19b7d219 examples/ppi/qtip4pf/qtip4pf.pos_3.xyz 30564 9198 12e029354bca5043f93fb8436b5a1ccbfea65c81 examples/ppi/qtip4pf/qtip4pf.force_3.xyz 30564 9197 31d1f02d9a1ce3f0224bcc221e0769b1c51df38d examples/ppi/qtip4pf/qtip4pf.force_0.xyz 30564 9197 6e1ba45f6efc329e1eeae1e90caba4dc5efd45ff examples/ppi/qtip4pf/qtip4pf.force_5.xyz 30564 9196 a11d414c934f211410b606ec6a10b4c6e15ab460 examples/ppi/qtip4pf/qtip4pf.force_6.xyz 30564 9196 a4311bf1b268d6d58e25f259a364d4f7438cda4e examples/ppi/qtip4pf/qtip4pf.force_2.xyz 30564 9197 b9b993ac2df51a380ae5b65da42cc7da389cac13 examples/ppi/qtip4pf/qtip4pf.force_7.xyz 30564 9198 c55b9aba770369b4c6516441b553834202b519d3 examples/ppi/qtip4pf/qtip4pf.force_4.xyz 30564 9197 e7281c1b170259d7f71bacb2ec2c05fe5e6e27d3 examples/ppi/qtip4pf/qtip4pf.force_1.xyz 17721 3605 a79efe62102e38b34e29fed219f00afda8a1892d examples/ppi/qtip4pf/benchmark/qtip4pf.energies.dat 15500 4339 71032e78a592a092937c30b2a6ce771506e8c5ff examples/lammps/h2o-mts/MTS-Ensemble/trial-01/rpc.pos_0.xyz 13785 13162 992bda057bd0e8dd4b9fb3fafa1b39d2f0f5e2f6 data/diss-zurich10.pdf 8995 1411 25347623730ac47702369201032ad73f7aa80cda test/ph2/test_ph2.pdb 6608 6392 fd3d30f6b2a249968f88e2157aebda584e837de2 movies/ice-cage.flv 4345 1760 cdceced9914f6622f14a1878eca1d14791b33a39 examples/lammps/paracetamol-einstein/input.xml 3374 3179 1d1fd08788b781080445763eb970b1c9eb6b5dd5 data/ceri14psik-highlight.pdf 3224 1463 34d41828d417b38a91534326c4acdd7270f4f3f9 examples/lj/nst/reference/lj-nst.pos_0.pdb 3122 2993 cff1c905a7651caa25aac7f44f2e775a50096794 data/i-pi_1.0.zip 2902 2769 8d650c408757239e49ab82138bb63c5ec124028a data/tut-lugano10.pdf 2085 1959 9c8e64ebb9605c8f3849c71cd83ea450a9a703e1 data/lugano10.pdf 1972 266 49a85bc43c83197f25d01fb6d17f9c3638dcf93a examples/lammps/newdyn/nst-ice.xc.pdb 1913 333 3b8aea8aae32e049510f9f41c5ffe8951182bd59 examples/cp2k/basis/dftd3.dat 1859 158 4416e8913a2d64baaa44c126ce9cc43c21ce16bf examples/lammps/h2o-mts/MTS-Ensemble/trial-04/log.lammps 1787 1003 1ae7d88e2ac652f70a6587c768447c0d2e61c25e examples/lj/nst/reference/lj-nst.pos_0.pdb 1736 1011 037343786bd33f7d9e74af45b0ad2512e6b0c7e1 examples/lj/nst/reference/lj-nst.pos_0.pdb 1556 431 57652c560fac09ed25128b1db0058ee993a31202 examples/lammps/h2o-mts/MTS-Ensemble/trial-04/rpc.pos_0.xyz 1344 814 14266545d35648a107bbe12aba9852e466f2a5e9 examples/lj/nst/reference/lj-nst.pos_0.pdb 1228 1197 3e63eeeddbb442bb3040a33195a5a23de7668d64 images/header-homepage.jpg 1213 1183 b82671699bbf75cbe79fcab1dbc88924612282b4 data/i-pi_hands-on.zip 1034 419 21e92a4ed33f15d863ce7798449a506eb53da4f8 examples/lammps/paracetamol-phonons/simulation-fd.dynmat 996 411 5bcaf53662751881fd0f3a261046992301e4aab8 examples/lammps/paracetamol-debye/hessian.data 989 409 da9da9f135199c5f408cd06d1d526f4cf893c843 examples/lammps/paracetamol-phonons/simulation-fd.hess 950 404 2119abc2cd589b225fac326fb5e99cd3d720bc5c examples/lammps/paracetamol-phonons/simulation-fd.mode — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

OndrejMarsalek commented 7 years ago

That would certainly be useful and will make the working tree slimmer, but the trickier part is removing it and other large deleted files from the repository. Because this means rewriting history, I want to be careful. Does anyone have experience with filter-branch?

tomspur commented 7 years ago

I did this a few times and it worked just fine. It is just unclear to me if you need to delete than this whole repository or if the older branches can still stay. So far, I was the only user of my repositories, where I did this, so this wasn't an issue back then

ceriottm commented 7 years ago

Let's do this carefully, but I am all in favor of rewriting history and cleaning up the repo.

On 17 April 2017 at 12:34, Thomas Spura notifications@github.com wrote:

I did this a few times and it worked just fine. It is just unclear to me if you need to delete than this whole repository or if the older branches can still stay. So far, I was the only user of my repositories, where I did this, so this wasn't an issue back then

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/i-pi-dev/issues/178#issuecomment-294447563, or mute the thread https://github.com/notifications/unsubscribe-auth/ABESZ0RzcNovfVQpg9pFW0Giag_ISvaCks5rw0BNgaJpZM4M_BTW .

OndrejMarsalek commented 7 years ago

I just tried using this tool:

https://rtyley.github.io/bfg-repo-cleaner/

and it seems to work great. After filtering all files larger than 1 M (the specifics can be tweaked for the production run, of course), I get a much more acceptable 25 MB .git directory, instead of the 220 MB. You can try yourself, locally, just make sure that you don't push anything. Note that I found it better to work without the --mirror option so that you can run git more easily, including the large file script I posted before.

It will be the push to GitHub that will be the most sensitive part of this operation. Once that is done, everyone with write access must update their local clones and never push from a clone of the old bloated repository. We need to find a way to coordinate this. I suggest setting a date and time well ahead of time, sending a big fat warning to everyone with push access and getting explicit agreement that they know about it and will not push the old repository.

grhawk commented 7 years ago

The main problem is that once we rewrite the history everyone must delete his local repo and download the one with the new history... [EDIT] exactly as Ondrej said at the end of his last post :)

OndrejMarsalek commented 7 years ago

It requires some coordination, but unless we plan to turn it into a weekly activity, I think it is worth it. Best way to ensure that it is rare is to be careful when pushing stuff to the repo.

grhawk commented 7 years ago

This could also be the right moment to separate the example from the repo of the actual code. It would make code revision much much simpler...