newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.55k stars 708 forks source link

_parse_user aborts if line does not match expected regex #413

Closed larsks closed 1 year ago

larsks commented 1 year ago

I was trying to remove all the release archives from a clone of http://smarden.org/git/runit.git with the following command:

git-filter-repo --invert-paths --path-glob '*.tar.gz' --force

But that fails with:

Traceback (most recent call last):
  File "/home/lars/bin/git-filter-repo", line 4006, in <module>
    main()
  File "/home/lars/bin/git-filter-repo", line 4003, in main
    filter.run()
  File "/home/lars/bin/git-filter-repo", line 3938, in run
    self._parser.run(self._input, self._output)
  File "/home/lars/bin/git-filter-repo", line 1412, in run
    self._parse_tag()
  File "/home/lars/bin/git-filter-repo", line 1291, in _parse_tag
    (tagger_name, tagger_email, tagger_date) = self._parse_user(b'tagger')
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lars/bin/git-filter-repo", line 1078, in _parse_user
    (name, email, when) = user_regex.match(self._currentline).groups()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'
fatal: stream ends early
fast-import: dumping crash report to .git/fast_import_crash_1855502

Modifying the code to add a breakpoint() just before that line, we see:

(Pdb) p self._currentline
b'tagger Gerrit Pape <pape@smarden.org>\n'

Whereas user_regex is looking for:

(Pdb) p user_regex
re.compile(b'tagger (.*?) <(.*?)> (.*)\n$')

Rewriting the regex to match is trivial, but it wouldn't provide a value for when, so I'm not sure of the correct fix.

newren commented 1 year ago

Thanks for the report.

I'm not sure I want filter-repo to try to handle all forms of repository corruption, and a tag without a tagger date is malformed. You can verify by just running git fsck on the repo. I manually created a repo with such a problem (with a tag named 1.0.0), and it reports as follows:

$ git fsck
error in tag b8c23fecae8bb7925ef4fb18203872adb4a0727f: missingSpaceBeforeDate: invalid author/committer line - missing space before date
Checking object directories: 100% (256/256), done.
Checking objects: 100% (3/3), done.

I can manually fix this repository corruption as follows:

$ git cat-file -p 1.0.0 >tmptag

Now, edit tmptag to give it a date (look at the output for other tags to see examples) and save your changes.

$ git update-ref refs/tags/1.0.0 $(git hash-object -w -t tag tmptag)
$ git prune

Granted, if that tag contains a gpg signature, then your editing will invalidate the signature...but running filter-repo is going to strip gpg signatures anyway so it doesn't really matter for this usecase.

Now, for your case, you'd have to replace 1.0.0 in both commands with the actual name of your tag, but once you have, fsck should be clean and filter-repo should work.

larsks commented 1 year ago

Thanks for taking a look! I wasn't sure if this was corruption or if it reflected the behavior of git at some point in the distant past.

Since it just a corrupt repository, I guess go ahead and close this issue (and the related pr).