gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.
http://gitpython.readthedocs.org
BSD 3-Clause "New" or "Revised" License
4.61k stars 906 forks source link

GitPython can stage changes for commit that git won't commit #1185

Open LinuxMercedes opened 3 years ago

LinuxMercedes commented 3 years ago

Steps to reproduce:

  1. Check out the active branch.
  2. Make an IndexFile object
  3. Stage changes by calling .add() on that object
  4. Write that object's state using .write()
  5. Make a commit using the git binary on the system (instead of using .commit() on the IndexFile object)

Here is a minimal working example:

from git.repo import Repo

r = Repo.init('repo')

# Create an initial commit and active branch
with open('repo/test.txt', 'w+') as fh:
    fh.write("initial text\n")

r.index.add('test.txt')
r.index.commit('initial commit')

# Make some changes to commit
with open('repo/test.txt', 'a+') as fh:
    fh.write("next commit text\n")

# Bug doesn't occur if we don't do this
r.active_branch.checkout()  ## creates TREE extension data in the index

# Note the reference to r.index
our_index = r.index
our_index.add('test.txt')  ## implicit .write() call removes TREE extension from the index
our_index.write()  ## writes cached, invalid TREE extension data to the index

print("Status: should show changes to commit")
print(r.git.status())

r.git.commit(m='direct git call')  ## confused by improperly invalidated TREE extension data

print("Status: should show nothing to commit but doesn't")
print(r.git.status())

### At this point, there are changes staged for commit that will *not* be committed by `git commit` ###

# Direct use of r.index works, though:
# Both the .add and .write calls are necessary here
r.index.add('test.txt')  ## implicit .write() call removes invalid TREE extension data from the index
r.index.write()  ## re-creating the IndexFile object forgets the cached extension data, so .write() does not write it

r.git.commit(m='direct git call 2')  ## no confusing TREE extension data; everything's fine

print("Status: actually shows nothing to commit")
print(r.git.status())

I've reproduced this on both Python 3.9 and 3.8, git 2.30, gitpython 3.0.8 and 3.1.13 across several machines, both Windows and Linux.

It seems the only difference between the calls to r.git.commit is that r.index.add and r.index.write are calls made to two different IndexFile objects. Using the same object for both calls results in the changes not being committed.

Added twists:

This is truly bewildering to me. I'm happy to try to debug more, but any sort of speculation on where to start would be helpful!

LinuxMercedes commented 3 years ago

Further investigation shows the .git/index file written varies between the our_index and r.index methods. Dumped via xxd:

Method our_index, giving staged changes that are not committed:

00000000: 4449 5243 0000 0002 0000 0001 6039 a53e  DIRC........`9.>
00000010: 08a8 be94 6039 a53e 08a8 be94 0001 0305  ....`9.>........
00000020: 0000 0244 0000 81a4 0000 03e8 0000 03e8  ...D............
00000030: 0000 001e 8a6f 2818 3214 fa46 80e8 52e5  .....o(.2..F..R.
00000040: 3696 4f25 bb2b f66a 0008 7465 7374 2e74  6.O%.+.j..test.t
00000050: 7874 0000 5452 4545 0000 0019 0031 2030  xt..TREE.....1 0
00000060: 0ac7 f5fb 6398 be92 0b61 2035 af0d 7fb3  ....c....a 5....
00000070: bd00 d3e9 76c7 0f28 27fc 52b3 10b2 00aa  ....v..('.R.....
00000080: 69ee 212c 5ed0 87fb 1e                   i.!,^....

Method r.index, giving staged changes that are committed:

00000000: 4449 5243 0000 0002 0000 0001 6039 a57e  DIRC........`9.~
00000010: 0b83 1a79 6039 a57e 0b83 1a79 0001 0305  ...y`9.~...y....
00000020: 0000 0244 0000 81a4 0000 03e8 0000 03e8  ...D............
00000030: 0000 001e 8a6f 2818 3214 fa46 80e8 52e5  .....o(.2..F..R.
00000040: 3696 4f25 bb2b f66a 0008 7465 7374 2e74  6.O%.+.j..test.t
00000050: 7874 0000 2bd6 566b 6686 f31c e140 2558  xt..+.Vkf....@%X
00000060: 8d62 2144 092a ae24                      .b!D.*.$

(index-format docs)

The differences in bytes 0x0C – 0x1B is a difference in file modification times and doesn't seem to be important.

The critical difference is the insertion in the our_index index file at 0x54 – 0x64 (the start of the TREE). This section is a "Cached tree" extension entry. As far as I can tell, referencing this for help, the entry looks valid and contains the right tree hash:

$ git show HEAD
commit 6f75bc303bc051ccbf0ecc15027ec1830d7cae1a (HEAD -> trunk)
$ git cat-file -p 6f75bc303bc051ccbf0ecc15027ec1830d7cae1a
tree c7f5fb6398be920b612035af0d7fb3bd00d3e976

And, lo and behold, changing our_index.write() to our_index.write(ignore_extension_data=True) fixes my bug.

Byron commented 3 years ago

First off, let me thank you for sharing this detective story, I am amazed by the skill that was needed to figure this one out!

It looks like simply maintaining the extension as is isn't the right way of doing things, maybe because it tells git that nothing has changed by just looking at the last bytes of the index.

Probably it would have been wise to ignore extension data by default, but changing it now would be a breaking change so probably won't be doable either.

Maybe there can be a compromise and GitPython could detect the TREE extension and ignore that selectively. This should fix this kind of issue for everyone and save a lot fo time.

Do you think that could work or do you have other ideas? It would be great if this conversation could lead to a PR to improve the situation for everyone. Thanks so much!

LinuxMercedes commented 3 years ago

First, thank you for the kind comment — would that everyone put such thought into their replies!

I don't know much about git index extensions, so take these opinions with a grain of salt.

It seems reasonable to try to preserve the extension data more; right now index.add() by default removes all extension data from the index, which would mess up things, for example, if the repo were in the middle of resolving a merge conflict.

I agree that parsing the extension data and removing or invalidating the TREE extension when necessary is the right move. I've yet to find documentation on exactly when TREE data needs to get invalidated, but the cache tree unit tests have a bunch of examples on when things need to get invalidated.

At this point, there are two approaches that I think we could take: 1) Whenever we invalidate TREE data, just remove all the TREE data from the extension data. Easy and correct, but slow on big repositories — but exactly as slow as the way things are now. 2) Replicate the TREE invalidation algorithm and use the cache tree extension's invalidation features. A bit more work to get right, but should make things much faster.

In addition, we'd need to decide if, after we get invalidation working, we should have index.add() and friends default to writing extension data.

I'm happy to work on any of this!

(Also: I'm going to edit the minimal working example to annotate how each piece of it contributes to the weird behavior.)

LinuxMercedes commented 3 years ago

On further looking, it seems we might also need to:

1) Invalidate or remove the Untracked cache (UNTR). 2) Update the End of Index Entry (EOIE). 3) Do something with the Index Entry Offset Table (IEOT). Not sure what caches this tracks or if it's only used while git is in the middle of an operation (in which case I don't think we'd need to care about it too much).

Byron commented 3 years ago

My apologies for the late reply. Here is my commentary :D

Whenever we invalidate TREE data, just remove all the TREE data from the extension data. Easy and correct, but slow on big repositories — but exactly as slow as the way things are now.

Furthermore this looks like it would fix the bug seen if the index is cached with 'minimal' work required. This might be a good candidate for a first PR which should also be well testable with the current unit test capabilities.

Invalidate or remove the Untracked cache (UNTR).

That should be as simple as removing TREE, but I wonder if git would respond just as confused as in the example above. Without knowing that it might be better to not touch it.

Update the End of Index Entry (EOIE).

Yes, definitely!

Do something with the Index Entry Offset Table (IEOT). Not sure what caches this tracks or if it's only used while git is in the middle of an operation (in which case I don't think we'd need to care about it too much).

In order to enable multi-threaded parsing of the cache, I expect various blocks of offsets there which should match the amount of entries in the index. At first, this one should definitely be removed if the amount of entries changed. Initially it should be easiest to unconditionally remove it.

To me it looks like only the UNTR extension type may stay without altering/adjusting the other extensions upon editing the tree. This seems like it would be a quick win over the current representation and enough to close this issue.

In any case, I thank you for investigating this (I learned a lot), and will be happy to see any PR :).

LinuxMercedes commented 3 years ago

My thought for why we might need to worry about UNTR is if we add() a file that used to be untracked. (I think that's this codepath?)

Also, this comment seems to imply that we can drop any all-caps extension data, though I'd like to be a bit more polite about it when that's not too hard to do.

I'll get started on a PR shortly!

Byron commented 3 years ago

My thought for why we might need to worry about UNTR is if we add() a file that used to be untracked. (I think that's this codepath?)

That's true actually - I guess my hope was that with untracked files out of sync there may be no malfunction, but why risk it. It seems easy enough to drop data once a certain code path was encountered.

Also, this comment seems to imply that we can drop any all-caps extension data, though I'd like to be a bit more polite about it when that's not too hard to do.

I love this attitude! Mine seems a bit sloppy at times so I am happy to see this countered :).

I'll get started on a PR shortly!

🎉🙏