Open Andrew-S-Rosen opened 1 year ago
Tagging @janosh just because I feel like he appreciates these kinds of things.
Oh snap! I had a chat about Sphinx with @jic198. I've personally not used Sphinx yet but this seems like Jianli might want to be aware of.
Hi @arosen93. Thanks for this but I can't seem to find this file anywhere in the repo. Can you let me know the command you used to display the above information. Also do you have an example command to use with BFG repo cleaner to filter these files?
Sure! To generate a sorted list of files in the commit history (based on this stack exchange post):
git clone https://github.com/materialsproject/atomate2.git
cd atomate2
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
sort --numeric-sort --key=2 |
cut -c 1-12,41- |
$(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
That produces the following (truncated):
12c86022b55f 5.0MiB .doctrees/environment.pickle
df99d6cf47ab 5.0MiB .doctrees/environment.pickle
017e8f108ab6 5.2MiB .doctrees/environment.pickle
c58ef06ef889 5.2MiB .doctrees/environment.pickle
3fc372356e04 5.2MiB .doctrees/environment.pickle
4693e7b8b4f1 5.3MiB .doctrees/environment.pickle
df223b93979c 5.3MiB .doctrees/environment.pickle
330457357bcc 5.3MiB .doctrees/environment.pickle
eb2f9fc502e9 5.4MiB .doctrees/environment.pickle
6a4f7368be42 5.4MiB _build/.doctrees/environment.pickle
bee7019c3cd2 7.9MiB tests/test_data/vasp/GaN_Mg_defect/bulk_relax/outputs/LOCPOT.gz
c19719afe3dd 12MiB .doctrees/environment.pickle
98da941b0575 12MiB tests/test_data/vasp/Si_config_coord/launcher_2022-04-11-21-57-18-132969/WAVECAR.gz
fd297184f16a 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=0/outputs/LOCPOT.gz
5162fb49b9b9 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=1/outputs/LOCPOT.gz
c717b8cc7d22 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-1/outputs/LOCPOT.gz
92d23b600e50 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-2/outputs/LOCPOT.gz
14155704594b 31MiB tests/test_data/vasp/BaTe_static/outputs/LOCPOT
4f1f09372be8 36MiB tests/test_data/lobster/lobsteroutputs/mp-754354/projectionData.lobster.gz
cac7cc7b512f 37MiB tests/test_data/lobster/lobsteroutputs/mp-2534/projectionData.lobster.gz
The reason why you likely don't see the Sphinx cache files is that they are only present when your docs are being deployed, so they aren't visible in the repo. Adding the relevant gitignore
pattern will resolve the issue going forward.
To remove the files using BFG:
# make a backup first of the repo just in case
git clone https://github.com/materialsproject/atomate2.git
cd atomate2
java -jar bfg-1.14.0.jar --delete-files environment.pickle
Then rerun the git history search command, which returns:
e8f4e9846190 3.1MiB tests/test_data/vasp/Si_CCD.bk/Si_CCD.tar
6dcba2466c58 3.8MiB tests/test_data/lobster/NaCl_lobster_run_0/outputs/projectionData.lobster.gz
b221def44971 4.3MiB tests/test_data/vasp/Si_optics/static/outputs/CHGCAR.gz
4a2d959c78de 4.3MiB tests/test_data/vasp/Si_band_structure/static/outputs/CHGCAR.gz
e0177fe689d6 4.5MiB tests/test_data/vasp/NaCl_static_relax_lobs/additional_static/outputs/CHGCAR.gz
a45a39d9b083 4.6MiB tests/test_data/vasp/Si_hse_optics/hse_static/outputs/CHGCAR.gz
d066574fca12 4.6MiB tests/test_data/vasp/Si_hse_band_structure/hse_static/outputs/CHGCAR.gz
9b05371a1fd6 4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/relax_1/outputs/LOCPOT.gz
b327a041092f 4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/relax_2/outputs/LOCPOT.gz
a6c977a2566f 4.8MiB tests/test_data/vasp/NaCl_static_relax_lobs/static_run/outputs/LOCPOT.gz
7a8776e3fe3a 4.9MiB tests/test_data/vasp/NaCl_static_relax_lobs/additional_static/outputs/LOCPOT.gz
bee7019c3cd2 7.9MiB tests/test_data/vasp/GaN_Mg_defect/bulk_relax/outputs/LOCPOT.gz
98da941b0575 12MiB tests/test_data/vasp/Si_config_coord/launcher_2022-04-11-21-57-18-132969/WAVECAR.gz
fd297184f16a 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=0/outputs/LOCPOT.gz
5162fb49b9b9 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=1/outputs/LOCPOT.gz
c717b8cc7d22 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-1/outputs/LOCPOT.gz
92d23b600e50 12MiB tests/test_data/vasp/GaN_Mg_defect/relax_Mg_Ga-0_q=-2/outputs/LOCPOT.gz
14155704594b 31MiB tests/test_data/vasp/BaTe_static/outputs/LOCPOT
4f1f09372be8 36MiB tests/test_data/lobster/lobsteroutputs/mp-754354/projectionData.lobster.gz
cac7cc7b512f 37MiB tests/test_data/lobster/lobsteroutputs/mp-2534/projectionData.lobster.gz
An optional but recommended final step that takes a little while:
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Then force push the modified repo.
Note: I'm not sure if the files will return if you merge a PR that has the old git history, so just keep an eye out.
Since it involves a force push, might be good to merge as many PRs as possible beforehand. Any PRs open at the time of force push will have to be rebased which some contributors might be unfamiliar with.
I'm going to close this, I think it is too late to do anything about it now.
i would keep this open. we just need to find a good time to do a force push. shouldn't even cause any merge conflicts given no one edits this file. the only issue is divergent histories meaning contributors need to do a git fetch && git reset --hard origin/main
to get back on the main branch after. we would add that command in a big warning at the top of the readme and leave it there for 2-3 months
old PRs won't be able to be merged without doing this and so won't reintroduce those files as @Andrew-S-Rosen feared.
Ok, I re-opened. Hopefully we can whittle down the open PRs over the next month or so.
I discovered this in my repos and found it in yours too, which isn't surprising because I've basically copied your docs building process... :)
When Sphinx builds your docs, it makes a cache file named
.doctrees/environment.pickle
that's ~5 MB in size and committed to your history when you re-build the docs. This cache file is not needed in your git history and only adds to the cloned repo size.Here's some examples for your atomate2 repo:
I'd recommend adding
*.doctrees*
to your.gitignore
and (optionally) using BFG Repo-Cleaner to clean them from your history. Some details of this pickle file are here.