materialsproject / atomate2

atomate2 is a library of computational materials science workflows
https://materialsproject.github.io/atomate2/
Other
166 stars 94 forks source link

BUG: Sphinx is caching docs data that increases your `.git` folder size #345

Open Andrew-S-Rosen opened 1 year ago

Andrew-S-Rosen commented 1 year ago

I discovered this in my repos and found it in yours too, which isn't surprising because I've basically copied your docs building process... :)

When Sphinx builds your docs, it makes a cache file named .doctrees/environment.pickle that's ~5 MB in size and committed to your history when you re-build the docs. This cache file is not needed in your git history and only adds to the cloned repo size.

Here's some examples for your atomate2 repo:

12c86022b55f  5.0MiB .doctrees/environment.pickle
df99d6cf47ab  5.0MiB .doctrees/environment.pickle
017e8f108ab6  5.2MiB .doctrees/environment.pickle
c58ef06ef889  5.2MiB .doctrees/environment.pickle
3fc372356e04  5.2MiB .doctrees/environment.pickle
4693e7b8b4f1  5.3MiB .doctrees/environment.pickle
df223b93979c  5.3MiB .doctrees/environment.pickle
330457357bcc  5.3MiB .doctrees/environment.pickle
eb2f9fc502e9  5.4MiB .doctrees/environment.pickle
6a4f7368be42  5.4MiB _build/.doctrees/environment.pickle

I'd recommend adding *.doctrees* to your .gitignore and (optionally) using BFG Repo-Cleaner to clean them from your history. Some details of this pickle file are here.

Andrew-S-Rosen commented 1 year ago

Tagging @janosh just because I feel like he appreciates these kinds of things.

janosh commented 1 year ago

Oh snap! I had a chat about Sphinx with @jic198. I've personally not used Sphinx yet but this seems like Jianli might want to be aware of.

utf commented 1 year ago

Hi @arosen93. Thanks for this but I can't seem to find this file anywhere in the repo. Can you let me know the command you used to display the above information. Also do you have an example command to use with BFG repo cleaner to filter these files?

Andrew-S-Rosen commented 1 year ago

Sure! To generate a sorted list of files in the commit history (based on this stack exchange post):

git clone https://github.com/materialsproject/atomate2.git
cd atomate2
git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

That produces the following (truncated):

12c86022b55f  5.0MiB .doctrees/environment.pickle
df99d6cf47ab  5.0MiB .doctrees/environment.pickle
017e8f108ab6  5.2MiB .doctrees/environment.pickle
c58ef06ef889  5.2MiB .doctrees/environment.pickle
3fc372356e04  5.2MiB .doctrees/environment.pickle
4693e7b8b4f1  5.3MiB .doctrees/environment.pickle
df223b93979c  5.3MiB .doctrees/environment.pickle
330457357bcc  5.3MiB .doctrees/environment.pickle
eb2f9fc502e9  5.4MiB .doctrees/environment.pickle
6a4f7368be42  5.4MiB _build/.doctrees/environment.pickle
bee7019c3cd2  7.9MiB tests/test_data/vasp/GaN_Mg_defect/bulk_relax/outputs/LOCPOT.gz
c19719afe3dd   12MiB .doctrees/environment.pickle
98da941b0575   12MiB tests/test_data/vasp/Si_config_coord/launcher_2022-04-11-21-57-18-132969/WAVECAR.gz
fd297184f16a   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=0/outputs/LOCPOT.gz
5162fb49b9b9   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=1/outputs/LOCPOT.gz
c717b8cc7d22   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-1/outputs/LOCPOT.gz
92d23b600e50   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-2/outputs/LOCPOT.gz
14155704594b   31MiB tests/test_data/vasp/BaTe_static/outputs/LOCPOT
4f1f09372be8   36MiB tests/test_data/lobster/lobsteroutputs/mp-754354/projectionData.lobster.gz
cac7cc7b512f   37MiB tests/test_data/lobster/lobsteroutputs/mp-2534/projectionData.lobster.gz

The reason why you likely don't see the Sphinx cache files is that they are only present when your docs are being deployed, so they aren't visible in the repo. Adding the relevant gitignore pattern will resolve the issue going forward.

To remove the files using BFG:

# make a backup first of the repo just in case
git clone https://github.com/materialsproject/atomate2.git
cd atomate2

java -jar bfg-1.14.0.jar --delete-files environment.pickle 

Then rerun the git history search command, which returns:

e8f4e9846190  3.1MiB tests/test_data/vasp/Si_CCD.bk/Si_CCD.tar
6dcba2466c58  3.8MiB tests/test_data/lobster/NaCl_lobster_run_0/outputs/projectionData.lobster.gz
b221def44971  4.3MiB tests/test_data/vasp/Si_optics/static/outputs/CHGCAR.gz
4a2d959c78de  4.3MiB tests/test_data/vasp/Si_band_structure/static/outputs/CHGCAR.gz
e0177fe689d6  4.5MiB tests/test_data/vasp/NaCl_static_relax_lobs/additional_static/outputs/CHGCAR.gz
a45a39d9b083  4.6MiB tests/test_data/vasp/Si_hse_optics/hse_static/outputs/CHGCAR.gz
d066574fca12  4.6MiB tests/test_data/vasp/Si_hse_band_structure/hse_static/outputs/CHGCAR.gz
9b05371a1fd6  4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/relax_1/outputs/LOCPOT.gz
b327a041092f  4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/relax_2/outputs/LOCPOT.gz
a6c977a2566f  4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/static_run/outputs/LOCPOT.gz
7a8776e3fe3a  4.9MiB tests/test_data/vasp/NaCl_static_relax_lobs/additional_static/outputs/LOCPOT.gz
bee7019c3cd2  7.9MiB tests/test_data/vasp/GaN_Mg_defect/bulk_relax/outputs/LOCPOT.gz
98da941b0575   12MiB tests/test_data/vasp/Si_config_coord/launcher_2022-04-11-21-57-18-132969/WAVECAR.gz
fd297184f16a   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=0/outputs/LOCPOT.gz
5162fb49b9b9   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=1/outputs/LOCPOT.gz
c717b8cc7d22   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-1/outputs/LOCPOT.gz
92d23b600e50   12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-2/outputs/LOCPOT.gz
14155704594b   31MiB tests/test_data/vasp/BaTe_static/outputs/LOCPOT
4f1f09372be8   36MiB tests/test_data/lobster/lobsteroutputs/mp-754354/projectionData.lobster.gz
cac7cc7b512f   37MiB tests/test_data/lobster/lobsteroutputs/mp-2534/projectionData.lobster.gz

An optional but recommended final step that takes a little while:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

Then force push the modified repo.

Note: I'm not sure if the files will return if you merge a PR that has the old git history, so just keep an eye out.

janosh commented 1 year ago

Since it involves a force push, might be good to merge as many PRs as possible beforehand. Any PRs open at the time of force push will have to be rebased which some contributors might be unfamiliar with.

utf commented 8 months ago

I'm going to close this, I think it is too late to do anything about it now.

janosh commented 8 months ago

i would keep this open. we just need to find a good time to do a force push. shouldn't even cause any merge conflicts given no one edits this file. the only issue is divergent histories meaning contributors need to do a git fetch && git reset --hard origin/main to get back on the main branch after. we would add that command in a big warning at the top of the readme and leave it there for 2-3 months

old PRs won't be able to be merged without doing this and so won't reintroduce those files as @Andrew-S-Rosen feared.

utf commented 8 months ago

Ok, I re-opened. Hopefully we can whittle down the open PRs over the next month or so.