git-for-windows / git

A fork of Git containing Windows-specific patches.
http://gitforwindows.org/
Other
8.33k stars 2.53k forks source link

Index File integrity fragile. Easily corrupted after unexpected shutdown/kernel panic. Lack of documentation. #1811

Closed ghost closed 4 years ago

ghost commented 6 years ago

Setup

$ git --version --build-options

git version 2.17.0.windows.1
cpu: x86_64
built from commit: e7621d891d081acff6acd1f0ba6ae0adce06dd09
sizeof-long: 4
Microsoft Windows 10.0.14393
Editor Option: Notepad++
Path Option: BashOnly
SSH Option: OpenSSH
CURL Option: OpenSSL
CRLF Option: CRLFAlways
Bash Terminal Option: MinTTY
Performance Tweaks FSCache: Enabled
Use Credential Manager: Enabled
Enable Symlinks: Disabled

Details

Bash
git rebase --interative
... kernel panic / BSOD
git rebase --abort

Explanation

The nature of the BSOD was more severe than usual, it had to do with storage devices, the very SSD which contained the OS and everything about Git. The BSODs were not related to git or any of it's components in any way.

After a reboot, and opening Bash, navigating to the same repository, it was still showing me being in the middle of rebase, I didn't check or ran any other command or anything else, I have simply out of precaution aborted the rebase immediately, after that the index errors appeared, however the index might have been corrupted before executing that command, I have no way to know that anymore.

While being in the middle of an interactive rebase, however, no obvious filesystem activity was happening, it was idle, there was no code documents opened, I was not actively editing files.

Kowledable people on IRC claim that index file is not critically important, even tho they say it should be an easy fix with "read-tree" I still wanted to dig deeper into this from my own curiosity at least but why not document my research on top of it here.

Every time the index file got corrupt, experienced it on Win7 as well as on Win10, which were on different storage devices, the name and the filesize of the index file remained the same, while contents were all simply filled with zeroes, NULNULNULNUL, I think this raises a question how this file is handled, if it was an unexpected shutdown wouldn't the contents simply be mangled instead of replaced by zeroes, it couldn't have had the time to write them? I suspect the file is being opened in a way and being left open, and gets written at a later stage, while the shutdown must have happened between that.

Some of the under the hood behavior

Contrary to what knowledgable and popular IRC people have been saying that mintty and bash has nothing to do with git, by looking at the process tree, everything about GitForWindows runs under a single git-bash.exe process, under it is mintty.exe and git.exe and other smaller utilites are all childrens.

This doesn't confirm one of the original theories, but at least maybe supports it (having Bash window opened at moment of BSOD), it could have affected and/or delayed the actual writing of the data to disk, the OS or the SSD holding it in it's cache.

Furthermore, I was told on IRC that git only touches the index file on 3 or 4 commands not including git status, I have done a scan with Process Monitor and found out even the commands that don't change the contents it, still rewrite it, I don't know how many as I mentioned I haven't done more testing yet, but this is what git status does (tested on an unaffected repo): gfw_issue_indexfilecorruption_procmon_renameindexlock_1sep2018

AFAIK it's a 1:1 copy, nothing was changed in practice. I have checked with hash and the before-after match, the file metadata for "date of modification" also stays the same as before, so from outside it gives a false appearance. The use of such .locks and rewriting may be perfectly normal for all I know may be used elsewhere by git when doing writing, but maybe there's some occasion where it may not work right when it comes to the index file, it may be handled a bit different than other stuff, I don't have an idea yet as I haven't done any more such testing due to lack of time, I've written this issue like a month after the whole thing began but most of the time I was troubleshooting the PC hardware/OS it self. It may be something the developers could look into.

The Win10 is configured to NOT auto-rebot after a - no power loss actually occured, but if it wasn't able to write any crashdumps (including minidumps), it probably couldn't flush any of the write caches as well due to the nature of the what looked like to be a hardware problem (I'm still not completely sure, I've cloned the same exact Win10 to another very similar SSD which is simply an 1 year earlier model of the same exact size and brand series, with all new sata cables, it is working so far, I was not able to recreate the issue after 3 weeks of troubleshooting, smart tests, combinations)

I hope people don't jump on this in a way that I somehow found the culprit, no, I have simply shown where the culprit could be in, none of these things could have had anything to do with it, but at least this may eventually lead to finding something in the code, or if that's all fine, then it's just some extra code to strengthen the integrity against such events, even if it's not not git fault at all,

At the end of the day the git index file might be a small and trivial thing, maybe this leads to something bigger in terms of data integrity robustness for overall git system, and I know there's other layers of securing the data, but here's the point, if git doesn't have actual rescue/repair tools to at least salvage things then even if you have enterprise data solutions, you'd still have to use the earlier snapshot which may not be most optimal, for example, even if you have data duplication, you'd still have a broken duplicate if it synced before you discovered the issue.

Justifications come from the reasons that this is not entertainment software, it's probably not justifiable in some kind of video game, but this is the basis of a lot of serious code, even if the index file is a small non-important thing, the mere interruption can cause delays in various businesses because of the time it may take to figure it out if it happens out of the blue and there's no mention of it in the docs.

Another is safety-critical projects, there will be more and more of these in the future, in an event of some data corruption it may delay the release of a vital patch for some real-life machinery in a wild coincidence, you have Intel developing a whole Linux distro just for safety-critical coding in mind.

One more reason it is justified to add official repairability/maintenance for this is because .git folder is by default hidden, making it harder for manual servicing for the average users (GUI) which would need additional step in some cases (deleting it, inspecting contents)

Lack of git documentation sorrounding this issue

There is no mention of index file corruption and how to repair/restore it in any of the documentation I have searched, I've used grep, correct me if I have missed it.

For example according to the git docs for fsck,

If no objects are given, git fsck defaults to using the index file

..etc.

There's many mentions of corruption of objects, but nothing about the index file. In the book v2 docs at https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery there is again no mention of index file it self, only about objects in the index package

And about the "read-tree" command it is again not mentioning any repairing/rebuilding of the index file in case of corruption, the thing appears to be mainly for something else, not part of maintenance, and none of the suboptions have seem to have anything specifically to do with recovery of a broken index file specifically for that purpose. https://git-scm.com/docs/git-read-tree

Checklist

ghost commented 6 years ago

Update1:

git fsck --lost-found does not produce any lost-found when the NUL corrupt index is present, after checking it gives the usual index file error:

$ git fsck --lost-found Checking object directories: 100% (256/256), done. Checking objects: 100% (321716/321716), done. error: bad signature fatal: index file corrupt

It does produce some stuff if the index file doesn't exist (renamed)

I don't know whether the index file contents has any impact on such fsck actions, whether or not is it recovering something falsely which isn't a problem but the lack of the index file would confuse fsck to think it is?

Next, after removing the corrupt index file, running git reset, recreated the index and running git fsck --lost-found the output was the same as before, 17 dangling commits and 32 dangling blobs.

ghost commented 6 years ago

Update2:

It appears that the abortion of rebase wasn't fully successful as I have predicted, it took a hard reset and cleanup of the untracked files, it all took manual figuring out, no such detection in git exists AFAIK, I've also spent chatting with others which guided and explained some of the stuff along the way.

This is what I have previously pointed out, I do understand serious development projects have knowledable people dealing with this and it could be a piece of cake with little time lost.

It's about understanding as well, even if the solution was given and is trivial, it's not helpful if the user doesn't understand all the sorroundings of these actions and the consequences, depending on various circumstances, might take more or less, so in my case I'm not a git expert and wanted to be extra sure and also record all the steps so I write it all down into this report, it took me combined several days to deal with it (if I don't count the PC hardware trobleshooting).

I simply don't agree that the most computer hardware issues that could damage git files are to be treated as a niche that is okay to be burried in some book and community-knowhow and not be properly dealth with by git systems and documentation. Some kind of a CPU Core cache bug that just happened to mess up git is probably a lot less justifiable, than something as basic as BSOD/Power loss.

In addition, the git rebase it self and other git commands could learn the ability to use a backup index file instead, when the main one is corrupt, instead of brokenly aborting the rebase like it happend in this case.

More effectively, and before any other idea, everything in git that uses the index file or any such metadata files, should first check it's integrity before doing anything, that should be a global rule for all such commands, not just rebase IMO, in this case git rebase proceeded to do something under the hood, and only then reported the corrupt index file error, and I know it did something, because the indicators in Bash for being inside a rebase were undone, so visually it falsely appeared that things were back to normal.

I would have to recreate the whole BSOD scenario to see what happens under the hood with git rebase --abort, if I get the time I might just do that, with the developer option in Windows to manually trigger a BSOD or simply unplug the PC power cord.

ghost commented 6 years ago

Update3: Process Monitor shows bash.exe and it's git.exe (and others) children as it's own process, not under git-bash.exe or mintty.exe, same for Process Explorer now, I'm not sure why this has changed now I was using real and dry-run commands during the day researching and solving this index file issue, without closing git bash for many hours.

Update4: Further info to the cleanup from update 2, after doing a hard reset to the commit I wanted (last one I was already brokenly on) the untracked files that the rebase supposably put in were still showing up in git status the clean command that was suggested by the community git clean -dfX did not appear to do what I needed it to do, initially I was using --dry-run to determine this, it was removing things like .vcxproj.user and .vs/build/binary folders which is what I don't want, using the dry run through trial and error I've seen the git clean -dfe --dry-run do exactly what I wanted. However according to the docs the -e suboption needs a pattern but I haven't specified one. The user who suggested the -dfX command was claiming it's not valid and shouldn't work, the --dry-run output shows the things it would remove and they match exactly to what git status shows as untracked.

So there's the new issues right now:

So I would have to go actually dig into this with procmon again, ... I did a bit until I got more and more weird results with git clean ... until I noticed something

The command "git clean -dfe --dry-run" is hot, it runs for real, it already removed those untracked files/entries, I have simply not noticed with my eyes that it was saying "removing ...." and not "Would remove .." It should either really be a dry run or abort due to syntax error, right!?!

So disregard my previous mistaken findings including that "the -e command did it", probably it would have removed those untracked files without -e command, but no more testing with from me unless I purposelly recreate the conditions, which I don't think is necessary at this time, it's not the primary issue.

I've opened a separate issue on this. #1812

PhilipOakley commented 6 years ago

Hi @Zexaron, Yes, Git can be confusing, especially as it's method of operation is 'the Linux way', rather than the Windows way, so may things just feel 'wrong'. Plus the Git internals do appear to need a CS degree to get one's head around all the fancy tricks in place.

The "Index" is one of those things mentioned above. It is an efficiency feature, rather than a core part of the data storage (the repository).

As I am sure you are aware Git uses a Directed Acyclic Graph (DAG) (a fancy 'tree' structure), so as long as the tips of the branches are known, you can follow the linked list that forms the commit graph/tree to see all your commit history, and each commit has a tree of the file meta data and their contents (blobs) taken at that time. None of that needs the Index.

During a rebase, Git creates a set of patches equivalent to the changes in each commit snapshot, and then re-applies those patches to (on top of) the new location. As the patches are applied, the updated branch pointer (the sha1/oid) is updated each time (IIRC) and stored in the .git directory as file .git/REBASE_HEAD and/or .git/rebase-merge/done (e.g. see https://github.com/Microsoft/vscode/issues/57299 and https://stackoverflow.com/questions/3921409/how-to-know-if-there-is-a-git-rebase-in-progress, randomly found by a quick search knowing what to look for..).

It is the presence IIRC of the .git/REBASE_HEAD that determines if Git thinks a rebase is in progress.

Also Git for Windows has to pull a few tricks to to make it look like it is both a bash terminal, and also run the Git program (which is of a run and finish style). Explaining all those bits about the git.exe is probably too much for a quick reply.

Hope that helps

drizzd commented 6 years ago

Note also that the "rebase in progress" message does not indicate that there is some process running in the background. On the contrary, it describes the situation that Git has paused the rebase and is waiting for user interaction. Aborting a rebase just happens to result in a similar state.

PhilipOakley commented 6 years ago

@Zexaron Can this be closed now?

And if not, could you summarise the outstanding issue. (s)

ghost commented 6 years ago

@PhilipOakley Just a few days ago someone on git IRC mentioned he had some corruption of some other ref files when the computer locked up and had to do a hard reset, the files were empty (which is very similar to this) ... he wasn't on Windows platform apparently.

What if some of the GTF components are holding on the Win OS write cache, if there's no git cache, or what, from flushing out, immediately. If git Bash is sitting idle for like more than 10 seconds it should have flushed and closed down FS stuff completely.

And again the system OS including git and git repo are all on a SSD, there's less reasons to hold on the cache or whatever it's doing, it's not like the SSD is saturated by some other process either. With a SSD it's far less likely that cache flush would be interrupted

Or, the file is not properly closed and in the BSOD event maybe some kind of command is sent and the disk firmware might have some kind of emergency actions or gets confused and simply writes NULs, what on earth would have caused a write 750 KB writes of zeroes ? it has to be some kind of deliberate action or the fact the file was was already nulled out before BSOD, and GTF loaded actual contents into RAM/Cache and had nulls temporairly in FS, I doubt it's doing that, it must be something lower level.

I open git files up regularly due to just my interest and learning git, while I have Bash open, I've never saw any contents missing like that during use.

This happened the same on multiple SSDs, and multiple OS, Win10 and Win7 which were on different SSDs. There's no bad block on the SSDs, no other OS filesystem issues whatsoever.

I had so many hard reboots, while playing games, using browsers, in lifetime, and such things simply don't happen, there wasn't a case when some kind of config file for a browser would just be empty or nulled out. Then my games drive would have to be half corrupt by now, but it isn't. How many times games freeze and a reboot is necessary, and it still doesn't do such corruption.

This issue just might be so elusive, I'm hesitant to call it resolved but if there's no progress I may be closed, however, while I was busy, I did actually plan to try to recreate the scenarios on purpose and try to see in more detail what's going on.

You might just keep it open for a while, as I might just ... recreate something, ofcourse if it turns out it's a global git issue then it would be closed and bug report opened on the git-scm mailing list AFAIK.

EDIT: I never had core.fscache enabled, unless it's enabled by default on ... i've updated to git 2.19 a few days ago, from 2.17 - I don't remember seeing a prompt for cache in the installer on this occasion, but I do remember the installer did mention this in the past maybe or I'm imagining things.

PhilipOakley commented 6 years ago

@Zexaron sorry to sound dumb, but what is the acronym 'GTF' in this case?

In terms of potential file errors, the Linux approach does memory map the file, and then later write a fresh temporary version before doing a 'swap in place', so if thinks go wrong the old file should still be available, or the new file be right there. At least that is my understanding.

Regarding the core.fscache, this is detected by probing the FS at install time and is set to the dominant system FS (typ C: on Windows ;-), so you may find that you have it enabled. (goes and looks at Issue reporting template at top): Performance Tweaks FSCache: Enabled So yes, you have it enabled ;-)

We can keep the Issue open for a little longer, if you report back as to any progress.

ghost commented 5 years ago

Oh sorry I meant GFW as in Git For Windows ... but that already clashes with Games For Windows anyway.

I'll try to recreate the scenario approximately and report back.

dscho commented 5 years ago

I'll try to recreate the scenario approximately and report back.

Anything to report yet?

dscho commented 5 years ago

Or maybe now?