icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
213 stars 107 forks source link

large files in git history #114

Closed JessicaS11 closed 4 years ago

JessicaS11 commented 4 years ago

Hi @JessicaS11, can't help but notice that you added two Tandem X DEM files in https://github.com/icesat2py/icepyx/pull/95/commits/aa6b5fb49ef5635e0428290ed5ba2cc943fabab8 to the repository which are used for the 'ICESat-2_DEM_comparison_Colombia_working.ipynb' tutorial. However, they're rather big files (25.7MB and 8.65MB), and git isn't usually meant for storing data files like this, so I just wanted to make sure (before you make any new commits) that this wasn't accidental/intentional?

Originally posted by @weiji14 in https://github.com/icesat2py/icepyx/pull/95#issuecomment-659309089

Thanks for pointing this out, @weiji14. I think these files have been present for awhile, but they were moved in the commit you point out. In looking into it, there are a few hdf5 files that are much larger that were accidentally uploaded early on and obviously now show up in the git history too. From my experience, the only way to get rid of them is to rewrite the git history. While I think that would ultimately be a good move (maybe we could even clean up the early commit history in the process), I'm hesitant to do so because of the downstream effects if every user does not rebase and remove them from their history, risking reintroduction (I had to remove some large files from another repo, and this was a challenge with a much smaller group). Or if I force change the master and development branches would we be able to catch this at the PR level if someone submits a PR with a branch that contains the large files (though presumably this would require an extra reviewer step to check the git history of the submitted PR)? Do you know if there are any alternative options or other ways to proceed given the group nature of the project (and entry-level git usage)?

JessicaS11 commented 4 years ago

@scottyhq Any ideas, since you helped solve this issue in the hackweek repos?

scottyhq commented 4 years ago

@JessicaS11 - yes I'd recommend scrubbing big files from the commit history, which would mean contributors need to re-sync with this upstream repository. It's a bit disruptive, but better than a bloated repository.

I see the following big files (using second answer here https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history)

84124d0a539d  3.9MiB doc/examples/ICESat-2_ATL08Parse_and_Visualize.ipynb
4ab8cf835487  8.0MiB dev-notebooks/ICESat-2Hackweek_tutorial_locations_draft.ipynb
c17285052462  8.6MiB examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif
01b7705584a9  8.9MiB dev-notebooks/ICESat-2Hackweek_tutorial_locations_draft.ipynb
19372def82df   26MiB examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif
caf019c3c4de   42MiB icepyx/data/ATL06_20190101001723_00540210_209_01.h5
5b45ad2b7edb   46MiB icepyx/data/ATL06_20190101003047_00540212_209_01.h5
72cc69254ca8   76MiB icepyx/data/ATL06_20190101002504_00540211_209_01.h5

You'll have to follow this documentation to remove the files https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository

A couple other suggestions:

  1. Note that you can add *tif and *h5 to .gitignore to avoid this happening in the future and remove output cells before saving jupyter notebooks.

  2. A versatile solution is to use a git "pre-commit hook" to check file sizes before committing (https://github.com/pre-commit/pre-commit-hooks#hooks-available).

  3. If people create pull requests with large files you'll see them in the list of files, and can therefore have them removed before merging. You could also consider adding a GitHub Action that ensures PR files are <X MB, which has to pass before merging. That way creators of a PR would see the issue in a more automated way.

JessicaS11 commented 4 years ago

Thanks @scottyhq. These excellent resources are all open tabs for me (that's how I found the big files to begin with!) and I was just hoping for a less potentially disruptive solution. I'm guessing this means that any existing branches will need to be closed or rebased to prevent reintroduction, right? Is the best way to do this going to be fix the master branch directly locally, force push the changes to GH, and then force push the changes up through the other branches, resolving conflicts as we go? I like the idea of a pre-commit hook or GitHub Action to prevent future issues (and I thought that h5 and tif were in .gitignore, but they're not!).

weiji14 commented 4 years ago

There's currently a movement towards replacing the 'master' branch name with a more neutral name like 'main' (see https://github.com/github/renaming), and this ties in with our code of conduct to be mindful of our language. Such changes are always disruptive, but I think it presents an opportunity for us to 'tidy' up our history here. This is one way it would work:

  1. Create a new branch called 'main' (or some other name) that would be a copy of the current 'master' branch without the big files.
  2. Rebase our 'development' branch (and other work in progress branches) on top of this new 'main' branch
  3. Once the migration is done, remove the 'master' branch (or we can keep it to preserve history).

Theoretically this would be a win-win, since we get the new 'main' branch name, and also prevent a force push from happening. I haven't tested it out yet though, but I think it is a good idea to consider :smile:

JessicaS11 commented 4 years ago

Great idea @weiji14. I'd been thinking about when/how to switch to the use of "main", so this seems like a great opportunity to make several big changes at once (and even make some of them easier). I've got a few other boxes to check and then hopefully I can get started on this!

weiji14 commented 4 years ago

Been having trouble finding the *.h5 files in the git history, does anyone know when they were inserted? Specifically the commit date and/or hash.

I can isolate the two Tandem X DEM .tif files, but was wondering why the .h5 files seem so hard to track down. Will need to find that commit (let's call it commit abcdef) so that we can branch off right before those big files were added (i.e. keep the git history intact before commit abcdef).

Just to be careful, I'll create an empty 'main' branch right at commit abcdef, and a mirror main-clean branch which is based on the current master branch (but without the large files). That way, I can open up a Pull Request from main-clean to main and everyone can double-check to make sure things look alright.

Edit: Nevermind, I've found the *.h5 files, they were sitting on a side branch so we don't have to remove them. Also created a draft PR to handle the 'master' to 'main' migration at #130 (ready for comments).

JessicaS11 commented 4 years ago

@scottyhq I get a fatal error (Cannot open existing pack file 'objects/pack/*.idx') when I run the join one-liner from the SO post you linked, but I've been using the "blazingly fast shell one-liner" answer with success. I've modified the first line of it to run on specific branches (e.g. git rev-list --objects --branches remotes/origin/master | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

I'm having trouble understanding the output though. I've isolated the h5 files to the get-variables branch, so they (as expected) don't show up if I run the above command as written (i.e. on master). However, the tif files show up on the main branch, which @weiji14 created (see #130) from a commit prior to their addition. A couple questions are:

  1. If the command is only showing all files (including history) for a given branch, why do those large files show up on the main branch?
  2. Is our assumption that we can use migrating to main as a way to "rewrite" (via rebasing, essentially) the history by branching it from a commit prior to the addition of the large files and ensuring that the commits that included adding the large files are squash merged with a commit removing them (i.e. not included in the history), then ultimately deleting the master branch (and any other branches that have the files in their history) a valid approach? Or are we missing something in how git stores history for deleted branches.
JessicaS11 commented 4 years ago

UPDATE: I've rebased the current master branch (so up to tag v.0.3.0) onto the main branch (currently sitting at commit deebe98 (origin/main) fix version, without the large .tif files (see #130). I updated .gitignore to include .tif and other large data files. Now, git diff origin/jsmain origin/master shows

diff --git a/.gitignore b/.gitignore
index ea72257..83468ea 100644
--- a/.gitignore
+++ b/.gitignore
@@ -109,11 +109,3 @@ venv.bak/

 # directory files
 .DS_Store
-
-# data files
-*.h5
-*.hdf
-*.hdf5
-*.csv
-*.zarr
-*.tif
\ No newline at end of file
diff --git a/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif
new file mode 100644
index 0000000..19372de
Binary files /dev/null and b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif differ
diff --git a/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif
new file mode 100644
index 0000000..c172850
Binary files /dev/null and b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif differ

as I'd expect. What I'd like to do is force push the jsmain branch to origin/master, essentially rewriting the history. Then I could rebase the development branch onto the updated master and force push those changes to origin/development. Then, any open PRs to the current development branch, as long as they are squash merged, shouldn't have any issues (excepting any files on which there are merge conflicts, which we'll have to resolve as we go).

When GitHub finishes developing tools to make it easy to transition from master to main, we'll be the first in line. I'd prefer to make that transition now (as previously discussed), but if the plan I've outlined to remove the large files from the git history will work, not manually making the switch now will save a lot of work as far as having to [re]set up branch rules, permissions, controls, etc. It will also make the transition more seamless for users, since GitHub plans to include warnings/instructions. @weiji14, how does that plan sound? @scottyhq @jhamman @fperez, am I creating a bigger problem than I'm solving with this approach?

weiji14 commented 4 years ago

as I'd expect. What I'd like to do is force push the jsmain branch to origin/master, essentially rewriting the history.

After taking a quick look, it seems like you've done a pretty good job here (essentially rewriting only the history on 'master' after June 2020 but keeping the rest intact), but someone else should double check this. Since you've informed everyone already (via email or otherwise) that this is happening, I'm +1 for going ahead with the rewrite.

Then I could rebase the development branch onto the updated master and force push those changes to origin/development.

Yep, this is ok.

When GitHub finishes developing tools to make it easy to transition from master to main, we'll be the first in line. I'd prefer to make that transition now (as previously discussed), but if the plan I've outlined to remove the large files from the git history will work, not manually making the switch now will save a lot of work as far as having to [re]set up branch rules, permissions, controls, etc. It will also make the transition more seamless for users, since GitHub plans to include warnings/instructions.

I'm happy with this (separating the 'large *.tif file removal' task from the 'master -> main rename task), the other suggestion (combining the two) was really just a throwaway idea. Let's just get on with this as it's been taking up far too much time already :no_mouth:

JessicaS11 commented 4 years ago

I've updated the master and development branches as indicated. For anyone with a current branch open (@annavalentine @bidhya @CharlieHewitt1 @wallinb @fspaolo @trevorskaggs @kelseybisson), so long as your PR to development is SQUASH MERGED, it should not reintroduce the large tif files to the history. However, your PR may show a lot of changes relative to the development branch due to the history rewriting (especially if you have merge conflicts), making it difficult to review (until you've dealt with the merge conflicts). To see what's actually changed and would be included in your PR commit, run git diff origin/<yourbranch> origin/development. Another strategy would be to rebase your branch to the development branch using git rebase -i <your branch name> development and then submit a PR. As always, please get in touch if you have any questions or issues!