kellykochanski / rescal-snow

A model of dunes and snow-waves
GNU General Public License v3.0
12 stars 5 forks source link

Eliminate large files from git history #5

Open r-barnes opened 5 years ago

r-barnes commented 5 years ago

The repo contains a number of large files that you likely wanted to ignore - the largest are listed below. This collectively means that the repo is a 100MB download.

41e6f427c11b  7.7MiB analysis/output_files/ALT_DATA2_OUT/fft/fft_results.gif
a8267f9be190  7.9MiB analysis/output_files/results_1/xcor/cross-correlations.txt
e271bcab6381   11MiB analysis/output_files/results_1/fft/fft_results.gif
669261e09a05   21MiB analysis/output_data/ALT_DATA1_OUT/xcor/cross-correlations.txt
36cbe3d82cf2   36MiB scripts/core.45511
4ac01836f00a   36MiB scripts/core.53132
9c2bb6f1759f   36MiB scripts/core.171982
a6cecc16b57b   57MiB analysis/output_data/ALT_DATA1_OUT/fft/fft_analysis_animation.gif
6def6506d3f7   66MiB scripts/GENESIS.log

these can be removed using the BFG repo cleaner using the following commands:

git clone --mirror https://github.com/kellykochanski/rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_files'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_data'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'core.*'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'GENESIS.log'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '*.o'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '*.py~'  rescal-snow.git
#Perhaps the `scripts/DUN.csp` file is also a temporary? It takes up 10MB.

after which you should check to make sure things look alright and then

cd rescal-snow.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive

The upside is that this reduces the repo size to either 11MB (with DUN.csp) or (1MB without DUN.csp), which saves bandwidth and space for users.

kellykochanski commented 5 years ago

Q: several of those large files are graphics useful for examples and documentation.

How can I leave them in, say, the online readme without requiring readers to download them?

On Tue, Jul 2, 2019, 2:03 PM Richard Barnes notifications@github.com wrote:

The repo contains a number of large files that you likely wanted to ignore

  • the largest are listed below. This collectively means that the repo is a 100MB download.

41e6f427c11b 7.7MiB analysis/output_files/ALT_DATA2_OUT/fft/fft_results.gif a8267f9be190 7.9MiB analysis/output_files/results_1/xcor/cross-correlations.txt e271bcab6381 11MiB analysis/output_files/results_1/fft/fft_results.gif 669261e09a05 21MiB analysis/output_data/ALT_DATA1_OUT/xcor/cross-correlations.txt 36cbe3d82cf2 36MiB scripts/core.45511 4ac01836f00a 36MiB scripts/core.53132 9c2bb6f1759f 36MiB scripts/core.171982 a6cecc16b57b 57MiB analysis/output_data/ALT_DATA1_OUT/fft/fft_analysis_animation.gif 6def6506d3f7 66MiB scripts/GENESIS.log

these can be removed using the BFG repo cleaner https://rtyley.github.io/bfg-repo-cleaner/ using the following commands:

git clone --mirror https://github.com/kellykochanski/rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_files' rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_data' rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'core.' rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'GENESIS.log' rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '.o' rescal-snow.git java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '*.py~' rescal-snow.git

Perhaps the scripts/DUN.csp file is also a temporary? It takes up 10MB.

after which you should check to make sure things look alright and then

cd rescal-snow.git git reflog expire --expire=now --all && git gc --prune=now --aggressive

The upside is that this reduces the repo size to either 11MB (with DUN.csp) or (1MB without DUN.csp), which saves bandwidth and space for users.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AEAG2VQ7JPHN4GNEH6LU6ULP5O7BLA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G47OPHQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AEAG2VRRZRN3BK2NXLE25NLP5O7BLANCNFSM4H47Y4BA .

r-barnes commented 5 years ago

They must be in the repo to appear in the readme, unless you host them elsewhere.

However, none of the files I've suggested purging (I don't think) are currently used by the repo. These are (I think) all large files that were mistakenly committed in the past. Removing from the repo using git rm doesn't remove them from the history, so the repo only ever grows in size unless you rewrite history.

The files you show on the readme are stored in example_images and take only 3.2MB. They should be unaffected by the commands I suggest above.

r-barnes commented 5 years ago

@kellykochanski: I thought we were fixing this prior to JOSS?

kellykochanski commented 5 years ago

I haven't had time to get to it, and don't want to rush into messing with the git history.

r-barnes commented 5 years ago

Okay. Can we chat about it prior to JOSS acceptance?

On Sat, Sep 21, 2019 at 11:46 AM Kelly Kochanski notifications@github.com wrote:

@rbarnes https://github.com/rbarnes I haven't had time to get to it, and don't want to rush into messing with the git history.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AAXZHVDU3RHR46HCVAT4T5DQKZTYDA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IXPZA#issuecomment-533821412, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXZHVHTY2DMUWO6PNBGOTDQKZTYDANCNFSM4H47Y4BA .

kellykochanski commented 5 years ago

@r-barnes I used bfg as you suggested, and the repo is now 14MB (including the removal of DUN.csp - I think some additional docs with figures have been added since you opened this).

r-barnes commented 5 years ago

Doing this before merging outstanding PRs could make doing so impossible or difficult...

On Thu, 26 Sep 2019, 08:10 Kelly Kochanski, notifications@github.com wrote:

@r-barnes https://github.com/r-barnes I used bfg as you suggested, and the repo is now 14MB (including the removal of DUN.csp - I think some additional docs with figures have been added since you opened this).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AAXZHVDOEDFP775Z3SGCXX3QLTGIDA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7V5SCQ#issuecomment-535550218, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXZHVHN77VO5MCTRD55HRLQLTGIDANCNFSM4H47Y4BA .

kellykochanski commented 5 years ago

bfg warned me... Any issue with just repeating the bfg calls after accepting the PRs?

zbeekman commented 5 years ago

I just went through a similar process with another repository, although the issue was more related to pruning & relocating sensitive information prior to open-sourcing a software package. I discovered that GitHub has write protected refs for PRs. This means that you cannot prune data from these by default.

However, I think I have special settings in my git config to fetch these PR refs that most users do not have, so this may not be a real issue (at least not if you're only concerned about repo file size; it certainly is when you're removing sensitive info).

If it turns out that the PR refs keep the repository size bloated, then, the only solutions are either:

1) Contacting GitHub support and asking them to delete the old PR refs (I'm not sure if they can/will do this for you) 2) Deleting and recreating the repository.

Hopefully you won't need to do either and the PR refs won't much this up for you.

r-barnes commented 5 years ago

@zbeekman: Cool idea! So that cleans the while repo and associated PRs all at once?

On Thu, 26 Sep 2019, 08:36 zbeekman, notifications@github.com wrote:

I just went through a similar process with another repository, although the issue was more related to pruning & relocating sensitive information prior to open-sourcing a software package. I discovered that GitHub has write protected refs for PRs. This means that you cannot prune data from these by default.

However, I think I have special settings in my git config to fetch these PR refs that most users do not have, so this may not be a real issue (at least not if you're only concerned about repo file size; it certainly is when you're removing sensitive info).

If it turns out that the PR refs keep the repository size bloated, then, the only solutions are either:

  1. Contacting GitHub support and asking them to delete the old PR refs (I'm not sure if they can/will do this for you)
  2. Deleting and recreating the repository.

Hopefully you won't need to do either and the PR refs won't much this up for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AAXZHVB7VWN4OW6PL664WWLQLTJJNA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WAMWY#issuecomment-535561819, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXZHVCPIT2PHSKVKCGTZYLQLTJJNANCNFSM4H47Y4BA .

zbeekman commented 5 years ago

@r-barnes

Cool idea! So that cleans the while repo and associated PRs all at once?

Not 100% sure what you're talking about here. If it's my point 2. "Deleteing and recreating the repository" then I need to explain a little bit further:

What I really mean, is:

  1. Move/rename the original repository (or at least keep a local backup clone as a copy, in addition to the one you plan to run BFG on, and then delete the original)
  2. See if you have any PR refs in your local bare/mirrored repository with git show-ref
  3. Use `git update-ref -d refs/.../... # protected github PR refs
  4. Run BFG to eliminate bloat
  5. Run the git reflog and git gc commands recommended by BFG
  6. Create a new empty repo
  7. git push --mirror or whatever BFG recommends to the new repo

I would not recommend this, unless the repo size stays large after a normal pass with BFG. Even then, it's much easier to contact GitHub support and ask if they can delete the old protected PR refs.

I had to go through this procedure because I realized that upon open sourcing a repository, you could still access old PR refs which included the sensitive information that cannot be made public. If you do not need to do it, then please don't.

Also, if you haven't run BFG yet to prune history, you may want to do it either before the final submission or not at all; I'm not sure if it will mess with JOSS' machinery, DOI process, etc. and it will certainly affect tagging.

kellykochanski commented 5 years ago

@zbeekman I ran bfg on the repository, though the changes were rejected from the then-open PR on kk/JOSS-fixes. Downloading rescal-snow is now down to 14MB from ~100MB.

I expect to have all open PRs closed at the time of JOSS acceptance, and will re-run bfg then - I can do this after finishing the corrections in your review, and merging the kk/JOSS-fixes branch, but before formal JOSS acceptance.

I hope bfg will work smoothly if all PRs are closed... Let me know if you think that it won't.

zbeekman commented 5 years ago

@kellykochanski: Yes it should work fine. IMO, you have images and stuff for the tutorials, and 14MB is probably how much space everything you want to keep takes up. But at the end of the day, I wouldn't bother with any steps that are more complicated than what you are doing. If you get complaints about rejected refs when you try to push due to PR refs, you can just delete them locally then try pushing again. (They will persist on the GitHub side, but I suspect this is fine and most people don't fetch them.)

r-barnes commented 5 years ago

@zbeekman: The issue is that the repo's history contains ~86MB worth of large temporary and output files which we accidentally committed and later removed.

On Thu, Sep 26, 2019 at 11:26 AM zbeekman notifications@github.com wrote:

@kellykochanski https://github.com/kellykochanski: Yes it should work fine. IMO, you have images and stuff for the tutorials, and 14MB is probably how much space everything you want to keep takes up. But at the end of the day, I wouldn't bother with any steps that are more complicated than what you are doing. If you get complaints about rejected refs when you try to push due to PR refs, you can just delete them locally then try pushing again. (They will persist on the GitHub side, but I suspect this is fine and most people don't fetch them.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AAXZHVFP7VHEPQI265U3U6LQLT5ETA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WQYKI#issuecomment-535628841, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXZHVETHTUQVOMZKOS6QRLQLT5ETANCNFSM4H47Y4BA .

zbeekman commented 5 years ago

[Edited for improved clarity 🤞]

@r-barnes: I'll pipe down and let you guys figure out what you want to do. My point was that it sounds like Kelly had success with BFG and got things down to 14MB. Deleting the entire github repository and re-creating it is (hopefully) beyond the scope of what you want/need to accomplish. At any rate, sorry for the confusion and feel free to ignore my previous comments.

If you run into troubles pushing back up to github after running BFG, let me know, it might be the PR refs issue, and I may know the solution. Either way I'd happily take a look.

r-barnes commented 5 years ago

@zbeekman: No worries, thanks for your help.

On Thu, Sep 26, 2019 at 12:08 PM zbeekman notifications@github.com wrote:

@r-barnes https://github.com/r-barnes: I'll pipe down and let you guys figure out what you want to do. My point was that it sounds like Kelly had success with BFG and got things down to 14MB, and deleting the entire github repository and re-creating it is (hopefully) beyond the scope of what you want/need to accomplish. At any rate, sorry for the confusion and feel free to ignore my previous comments.

If you run into troubles pushing back up to github after running BFG, let me know, it might be the PR refs issue, and I may know the solution. Either way I'd happily take a look.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kellykochanski/rescal-snow/issues/5?email_source=notifications&email_token=AAXZHVGJIH2XU6SEAX5F36TQLUCCBA5CNFSM4H47Y4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WUTTQ#issuecomment-535644622, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXZHVBDD7OROUVWDDKHJHLQLUCCBANCNFSM4H47Y4BA .