Open LinuxMercedes opened 3 years ago
Further investigation shows the .git/index
file written varies between the our_index
and r.index
methods. Dumped via xxd
:
Method our_index
, giving staged changes that are not committed:
00000000: 4449 5243 0000 0002 0000 0001 6039 a53e DIRC........`9.>
00000010: 08a8 be94 6039 a53e 08a8 be94 0001 0305 ....`9.>........
00000020: 0000 0244 0000 81a4 0000 03e8 0000 03e8 ...D............
00000030: 0000 001e 8a6f 2818 3214 fa46 80e8 52e5 .....o(.2..F..R.
00000040: 3696 4f25 bb2b f66a 0008 7465 7374 2e74 6.O%.+.j..test.t
00000050: 7874 0000 5452 4545 0000 0019 0031 2030 xt..TREE.....1 0
00000060: 0ac7 f5fb 6398 be92 0b61 2035 af0d 7fb3 ....c....a 5....
00000070: bd00 d3e9 76c7 0f28 27fc 52b3 10b2 00aa ....v..('.R.....
00000080: 69ee 212c 5ed0 87fb 1e i.!,^....
Method r.index
, giving staged changes that are committed:
00000000: 4449 5243 0000 0002 0000 0001 6039 a57e DIRC........`9.~
00000010: 0b83 1a79 6039 a57e 0b83 1a79 0001 0305 ...y`9.~...y....
00000020: 0000 0244 0000 81a4 0000 03e8 0000 03e8 ...D............
00000030: 0000 001e 8a6f 2818 3214 fa46 80e8 52e5 .....o(.2..F..R.
00000040: 3696 4f25 bb2b f66a 0008 7465 7374 2e74 6.O%.+.j..test.t
00000050: 7874 0000 2bd6 566b 6686 f31c e140 2558 xt..+.Vkf....@%X
00000060: 8d62 2144 092a ae24 .b!D.*.$
The differences in bytes 0x0C – 0x1B is a difference in file modification times and doesn't seem to be important.
The critical difference is the insertion in the our_index
index file at 0x54 – 0x64 (the start of the TREE
). This section is a "Cached tree" extension entry. As far as I can tell, referencing this for help, the entry looks valid and contains the right tree hash:
$ git show HEAD
commit 6f75bc303bc051ccbf0ecc15027ec1830d7cae1a (HEAD -> trunk)
$ git cat-file -p 6f75bc303bc051ccbf0ecc15027ec1830d7cae1a
tree c7f5fb6398be920b612035af0d7fb3bd00d3e976
And, lo and behold, changing our_index.write()
to our_index.write(ignore_extension_data=True)
fixes my bug.
First off, let me thank you for sharing this detective story, I am amazed by the skill that was needed to figure this one out!
It looks like simply maintaining the extension as is isn't the right way of doing things, maybe because it tells git
that nothing has changed by just looking at the last bytes of the index.
Probably it would have been wise to ignore extension data by default, but changing it now would be a breaking change so probably won't be doable either.
Maybe there can be a compromise and GitPython could detect the TREE
extension and ignore that selectively. This should fix this kind of issue for everyone and save a lot fo time.
Do you think that could work or do you have other ideas? It would be great if this conversation could lead to a PR to improve the situation for everyone. Thanks so much!
First, thank you for the kind comment — would that everyone put such thought into their replies!
I don't know much about git index extensions, so take these opinions with a grain of salt.
It seems reasonable to try to preserve the extension data more; right now index.add()
by default removes all extension data from the index, which would mess up things, for example, if the repo were in the middle of resolving a merge conflict.
I agree that parsing the extension data and removing or invalidating the TREE
extension when necessary is the right move. I've yet to find documentation on exactly when TREE
data needs to get invalidated, but the cache tree unit tests have a bunch of examples on when things need to get invalidated.
At this point, there are two approaches that I think we could take:
1) Whenever we invalidate TREE
data, just remove all the TREE
data from the extension data. Easy and correct, but slow on big repositories — but exactly as slow as the way things are now.
2) Replicate the TREE
invalidation algorithm and use the cache tree extension's invalidation features. A bit more work to get right, but should make things much faster.
In addition, we'd need to decide if, after we get invalidation working, we should have index.add()
and friends default to writing extension data.
I'm happy to work on any of this!
(Also: I'm going to edit the minimal working example to annotate how each piece of it contributes to the weird behavior.)
On further looking, it seems we might also need to:
1) Invalidate or remove the Untracked cache (UNTR
).
2) Update the End of Index Entry (EOIE
).
3) Do something with the Index Entry Offset Table (IEOT
). Not sure what caches this tracks or if it's only used while git
is in the middle of an operation (in which case I don't think we'd need to care about it too much).
My apologies for the late reply. Here is my commentary :D
Whenever we invalidate TREE data, just remove all the TREE data from the extension data. Easy and correct, but slow on big repositories — but exactly as slow as the way things are now.
Furthermore this looks like it would fix the bug seen if the index
is cached with 'minimal' work required. This might be a good candidate for a first PR which should also be well testable with the current unit test capabilities.
Invalidate or remove the Untracked cache (UNTR).
That should be as simple as removing TREE
, but I wonder if git
would respond just as confused as in the example above. Without knowing that it might be better to not touch it.
Update the End of Index Entry (EOIE).
Yes, definitely!
Do something with the Index Entry Offset Table (IEOT). Not sure what caches this tracks or if it's only used while git is in the middle of an operation (in which case I don't think we'd need to care about it too much).
In order to enable multi-threaded parsing of the cache, I expect various blocks of offsets there which should match the amount of entries in the index. At first, this one should definitely be removed if the amount of entries changed. Initially it should be easiest to unconditionally remove it.
To me it looks like only the UNTR
extension type may stay without altering/adjusting the other extensions upon editing the tree. This seems like it would be a quick win over the current representation and enough to close this issue.
In any case, I thank you for investigating this (I learned a lot), and will be happy to see any PR :).
My thought for why we might need to worry about UNTR
is if we add()
a file that used to be untracked. (I think that's this codepath?)
Also, this comment seems to imply that we can drop any all-caps extension data, though I'd like to be a bit more polite about it when that's not too hard to do.
I'll get started on a PR shortly!
My thought for why we might need to worry about UNTR is if we add() a file that used to be untracked. (I think that's this codepath?)
That's true actually - I guess my hope was that with untracked files out of sync there may be no malfunction, but why risk it. It seems easy enough to drop data once a certain code path was encountered.
Also, this comment seems to imply that we can drop any all-caps extension data, though I'd like to be a bit more polite about it when that's not too hard to do.
I love this attitude! Mine seems a bit sloppy at times so I am happy to see this countered :).
I'll get started on a PR shortly!
🎉🙏
Steps to reproduce:
IndexFile
object.add()
on that object.write()
git
binary on the system (instead of using.commit()
on theIndexFile
object)Here is a minimal working example:
I've reproduced this on both Python 3.9 and 3.8, git 2.30, gitpython 3.0.8 and 3.1.13 across several machines, both Windows and Linux.
It seems the only difference between the calls to
r.git.commit
is thatr.index.add
andr.index.write
are calls made to two differentIndexFile
objects. Using the same object for both calls results in the changes not being committed.Added twists:
our_index.write_tree()
and usegit commit-tree
, the changes staged get committed.our_index.commit()
, the changes staged get committed.r.active_branch.checkout()
, the changes staged get committed.This is truly bewildering to me. I'm happy to try to debug more, but any sort of speculation on where to start would be helpful!