msysgit / git

msysGit-based Git for Windows 1.x is now superseded by Git for Windows 2.x
http://github.com/git-for-windows/git
Other
1.01k stars 317 forks source link

Git does not see that file in working directory differs from HEAD #312

Closed j66st closed 9 years ago

j66st commented 9 years ago

I found a strange problem in msysGit, I am wondering if it's a bug. I already discussed it here: https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc You can download my Demo.zip attachment there. It seems not to happen in Git for Linux or OS-X. The problem occurs in msysGit 1.9.0 and 1.9.5 (running x86 version on a Windows 7 x64 system).

Essentially, the problem seems that Git for Windows assumes that two files are the same when both the timestamp and the file size match. Obviously the file contents is not inspected nor the hash recalculated.

As mentioned I created a minimal demo package to easily prove the issue. Simply download the zip file, put it in a clean directory and run the enclosed script from bash. Below is a transcript of what happens when I run the script in my situation.

The issue is a real show-stopper in my automated migration from Visual SourceSafe to Git (in a test transferring 20,000 source files the problem caused loss of roughly 0.1% of the files from the history), so I hope for a quick solution.

Welcome to Git (version 1.9.5-preview20141217)

Run 'git help git' to display the help index.
Run 'git help <command>' to display help for specific commands.

joostadm@MUPC20 ~
$ cd /d/proj/try/git

joostadm@MUPC20 /d/proj/try/git
$ ls -l
total 15
-rw-r--r--    1 joostadm Administ    26968 Jan 23 18:43 DixiLink2.zip
-rwxr-xr-x    1 joostadm Administ     1848 Feb  5 14:05 weird-git-demo

joostadm@MUPC20 /d/proj/try/git
$ weird-git-demo
Note: No argument supplied, using DemoRepo by default.
+ mkdir DemoRepo
+ cd DemoRepo
+ unzip ../DixiLink2.zip
Archive:  ../DixiLink2.zip
   creating: .git/
 extracting: .git/COMMIT_EDITMSG
  inflating: .git/config
  inflating: .git/description
  inflating: .git/gitk.cache
 extracting: .git/HEAD
   creating: .git/hooks/
  inflating: .git/hooks/applypatch-msg.sample
  inflating: .git/hooks/commit-msg.sample
  inflating: .git/hooks/post-update.sample
  inflating: .git/hooks/pre-applypatch.sample
  inflating: .git/hooks/pre-commit.sample
  inflating: .git/hooks/pre-push.sample
  inflating: .git/hooks/pre-rebase.sample
  inflating: .git/hooks/prepare-commit-msg.sample
  inflating: .git/hooks/update.sample
  inflating: .git/index
   creating: .git/info/
  inflating: .git/info/exclude
   creating: .git/logs/
  inflating: .git/logs/HEAD
   creating: .git/logs/refs/
   creating: .git/logs/refs/heads/
  inflating: .git/logs/refs/heads/master
   creating: .git/objects/
   creating: .git/objects/29/
 extracting: .git/objects/29/c51bbb9ada43dbe98cbd5dbedbab56586b24c3
   creating: .git/objects/68/
 extracting: .git/objects/68/b1f129eb06b91fc6c9a3885fc0d24ad0cdaa50
   creating: .git/objects/7e/
 extracting: .git/objects/7e/8d94849a38370141b3ca2aab5c3cadc27934da
   creating: .git/objects/cf/
 extracting: .git/objects/cf/13dbb766d4aff04ad51d3cdac84fda67dc6f50
   creating: .git/objects/da/
 extracting: .git/objects/da/eca6b50a218a78e4cca5566965381e2d384b7f
   creating: .git/objects/de/
 extracting: .git/objects/de/3643596c73be4f6112d027616c8df31acd1b09
   creating: .git/objects/ec/
 extracting: .git/objects/ec/88fb155c5e9ebd8b3ead39345618517c0af6cf
   creating: .git/objects/info/
   creating: .git/objects/pack/
   creating: .git/refs/
   creating: .git/refs/heads/
 extracting: .git/refs/heads/master
   creating: .git/refs/tags/
 extracting: .git/refs/tags/2011-11-24
  inflating: dixilinkerr.h
+ echo 'This repository now contains version 1 and 2 of dixilinkerr.h:'
This repository now contains version 1 and 2 of dixilinkerr.h:
+ git log
commit cf13dbb766d4aff04ad51d3cdac84fda67dc6f50
Author: Joost <joost@localhost>
Date:   Tue Sep 11 09:57:32 2007 +0000

    @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h

commit de3643596c73be4f6112d027616c8df31acd1b09
Author: Joost <joost@localhost>
Date:   Fri Jul 6 12:27:57 2007 +0000

    @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
+ echo 'Version 3 is in our working directory:'
Version 3 is in our working directory:
+ head -n 12 dixilinkerr.h
// DixiLinkErr.h

#pragma once

#ifndef EXCPCAT_DIXILINK
#define EXCPCAT_DIXILINK 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'The file in our HEAD'
The file in our HEAD
+ git show HEAD:dixilinkerr.h
+ head -n 12
// DixiLinkErr.h

#pragma once

#ifndef CAT_DIXILINK_ERR
#define CAT_DIXILINK_ERR 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'You see? Working copy differs from HEAD.'
You see? Working copy differs from HEAD.
+ echo 'So the working directory is dirty, right? Ask Git:'
So the working directory is dirty, right? Ask Git:
+ git status
On branch master
nothing to commit, working directory clean
+ git diff
+ echo 'In my situation, Git sees no difference here, I THINK THIS IS WRONG!'
In my situation, Git sees no difference here, I THINK THIS IS WRONG!
+ echo 'So we are unable to add version 3 of our file. Let'\''s try it once again:'
So we are unable to add version 3 of our file. Let's try it once again:
+ git add dixilinkerr.h
+ git status
On branch master
nothing to commit, working directory clean
+ echo 'In my situation at this point there is nothing to add or commit.'
In my situation at this point there is nothing to add or commit.
+ echo 'End of demo'
End of demo

joostadm@MUPC20 /d/proj/try/git
$ 
sschuberth commented 9 years ago

I just manually unzipped your DixiLink2.zip and opened a Git Bash in the created directory:

Welcome to Git (version 1.9.5-preview20141217)

Run 'git help git' to display the help index.
Run 'git help <command>' to display help for specific commands.

~/Downloads/Demo/DixiLink2 master
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

~/Downloads/Demo/DixiLink2 master
$ git diff
diff --git a/dixilinkerr.h b/dixilinkerr.h
index daeca6b..b5edf0e 100644
--- a/dixilinkerr.h
+++ b/dixilinkerr.h
@@ -2,8 +2,8 @@

 #pragma once

-#ifndef CAT_DIXILINK_ERR
-#define CAT_DIXILINK_ERR 2000
+#ifndef EXCPCAT_DIXILINK
+#define EXCPCAT_DIXILINK 2000
 #endif

 #ifndef __DixiLinkErr_H_INCLUDED__
@@ -16,7 +16,7 @@
 #ifndef BASE_DIXILINK_ERR
        #define BASE_DIXILINK_ERR 0
        #ifdef USE_DEFAULT_CAT
-               #define BASE_DIXILINK_ERR CAT_DIXILINK_ERR
+               #define BASE_DIXILINK_ERR EXCPCAT_DIXILINK
        #endif
 #endif

As you can see, the file shows up as modified for me.

EDIT: But interestingly, I can reproduce the issue with your script. I've compared the full directory contents from manually unzipping DixiLink2.zip vs. your script, and although they look equal, the Git statuses differ. Indeed odd.

j66st commented 9 years ago

I think it depends on which unzip tool you use. If possible, please try again unpacking the zip file with Info-ZIP v3.0. What's relevant here is that the timestamps of the files are preserved. We could reproduce the problem on more than one PC. The problem disappears when touching the unpacked file.

sschuberth commented 9 years ago

I noticed there is a off-by-one difference in the seconds of the creation time when the archive is unpacked via "unzip" vs "WinRAR". That is one thing.

On the other hand, the issue disappears after deleting .git/index and git add dixilinkerr.h. I'm not sure about the index file format, but if it contains the timestamp of dirty files, and "unzip" restores the wrong time, Git might get confused.

The fact that the issue disappears when using "WinRAR" instead of "unzip" makes me believe that the bug is rather in Unzip than in Git.

Did this issue already show up in the original repo before zipping / unzipping it?

j66st commented 9 years ago

Yes the issue showed up in the original repo. (And of course also when pulling the repo directory over to another PC over the LAN.) I am migrating Visual SourceSafe repositories using the tool vss2git (http://code.google.com/p/vss2git/, it is a .NET tool, running on Windows, it uses git.exe to add files to the git repo, so no black magic). In a test moving 20,000 source files I lost 0.1% of the files. When inspecting the log I found out that the files missing were those of which the last version has the same file size as the previous version. In those cases the git add command failed. EDIT: So no ZIP involved here.

j66st commented 9 years ago

I think Git should not cache file timestamps, or at least there should be an option to turn it off.

dscho commented 9 years ago

I think Git should not cache file timestamps, or at least there should be an option to turn it off.

Git is very conservative when looking at timestamps (it does not even trust the index if it has the very same timestamp as the file it is looking at), so there must be something else going wrong.

sschuberth commented 9 years ago

I also always thought that Git is only tracking file contents, and nothing else. But meanwhile Hannes replied to the corresponding mailing list thread saying something different ...

j66st commented 9 years ago

Now the fog is clearing. In my case the vss2git visits every file in VSS, rewinds the deltas until the origin, then starts replaying the history and adds and commits every version to git. It will do so very quickly. Knowing that the precision of timestamps on some Windows filesystems (FAT for example) is 1 or 2 seconds, and knowing that some files (like icon bitmaps) don't change size in their history it is hard to avoid clashes without inspecting file contents. You can expect other problems because of the different ways timezones and daylight saving time are handled across filesystems.

See my latest post on the https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc thread.

So please, dear developers, please let Git to only track file contents (reliably tracked by SHA-1 hashes) and nothing else. If there is a performance penalty, so be it. Any shortcut that comes with a reliability penalty should have a switch to turn it off!

I am not familiar with the internals of the Git index to really understand what is hidden there, and why the file content/hash inspection has been bypassed. But I know the basics of both the Windows and Unix filesystems, so I will follow any discussion here about the details with interest. I am willing to test any fix; the problem is fully reproducible here.

YueLinHo commented 9 years ago

Looks like the old "racy git" problem???

Failed workaround: (as this said) debug3

Is it possible that implement inode via Windows MFT?

t-b commented 9 years ago

The screen short from @YueLinHo shows read-cache.c for all you don't know git's code by heart. @j66st I'm not sure that is_racy_timestamp is really the problem here. But you can try that happens if you change the function to always return 1?

Regarding with the nanoseconds timestamps. On Windows we have timestamps with at most 100ns resolution according to https://msdn.microsoft.com/en-us/library/windows/desktop/ms724290%28v=vs.85%29.aspx.

If I understand the code correctly we also do not use USE_NSEC in msysgit/git-for-windows-sdk.

YueLinHo commented 9 years ago

@j66st I suppose there is a workaround for your vss2git that insert a command git reset --mixed after checkout from VSS and before git commit.

@t-b I guess the root cause is that Git for Windows lacks some file system features. And the workaround for racy git problem is failed, too. so issue here.

See basic logic match_stat_data() first.

Git for Windows has its own lstat() and st_ino, st_gid, st_uid are always 0. (see do_lstat()) (So, I associate the st_ino with MFT file. (but it is NTFS specific.) :P ) And the USE_NSEC is disabled by default. I suppose it is a good idea to enable USE_NSEC. (maybe need some fix about it) ^_^

And I also recall the issue #229, not sure if it is the same root cause.

BTW, I guess such problems will come about often, because of SSD.

YueLinHo commented 9 years ago

if is_racy_timestamp() always return 1, then ce_modified_check_fs() might be called. (see here) And it check file context and calculate the sha-1 when the ce_compare_data() is called. (see here and here)

kblees commented 9 years ago

In your Demo.zip, the dixilinkerr.h file has a modification time of 2009-09-17, and the .git/index file has a modification time of 2015-01-23. Yet you say that the file was changed after committing version 2. So it seems your VSS conversion tool messes up modification timestamps. Make sure it doesn't do that and everything should work as expected.

Btw. I don't think this is a Windows-specific problem (or even Git-specific). Playing around with file times will fool Git on Linux as well (as outlined in racy-git.txt) as well as most backup software.

YueLinHo commented 9 years ago

@kblees Thank you. That makes me think more deeply.

j66st commented 9 years ago

@t-b: Re 100 ns timestamp resolution: that's only in theory. The clock ticks at a slower pace and a file system like FAT can only store seconds. Re changing code: I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others. @kblees Indeed I guess that the problem could be caused by extracting files from VSS will restore their original mtime (but don't call that "messing up"). Restore of a backup would do so too. But Git seems not prepared to that: the file in the working directory will be older than Git's index file, probably that's the reason Git looks no further and assumes that the file is stale. I wish there would be an option to turn off this reasoning; I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

I really wished that Git could preserve timestamps too! I'm so used to seeing a file's age in the directory listing and being able to sort out which files were changed this week, etc.. Everyone moving from VSS to Git (or TFS, for that matter) is complaining about the loss of this information. I changed the vss2git tool to keep this information at least in the commit message. So I'm not "playing around" with mtimes, that's what Git is doing. I consider them valuable. Backup software ought to keep an index for itself which files it saved last time in order to do incrementals. Otherwise files copied from another medium would never be backed up.

@YueLinHo Using git reset --mixed after extracting the file and before the git add is not a workaround. I tried it, Git still does not see the file as changed. But just touching the file would be sufficient, I guess. So that the working copy is never older than Git's index.

sschuberth commented 9 years ago

I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others.

At least the rebuilding part we have made pretty easy by now. Simply download the latest snapshot of the Git for Windows SDK, run the "Git Development Environment" from the Start Menu, and do cd git && make.

I really wished that Git could preserve timestamps too!

For reference, here's the FAQ entry why it does not. However, I agree there could be an option to still do so.

YueLinHo commented 9 years ago

@j66st Can vss2git keep the changed file list for one commit? If it does not, once lots files in working tree, touching all files is not that smart. So, could you please try the following 2 ways after vss checkout?

  1. git rm --cached -r && git reset --mixed
  2. delete the .git/index, then git reset --mixed
dscho commented 9 years ago

I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others.

I hope you understand that a statement like this seems very arrogant because you expect others to spend time helping you but do not see fit to set aside time yourself.

I really wished that Git could preserve timestamps too!

The mtime of a file is the time when it was last modified in this particular spot on the file system. That is why a copy of a file dated January 1st 1980 will get a new time stamp, even if the content technically is really that old.

That is the way Git treats timestamps.

You probably can convince VSS to update files' mtime to the appropriate value, too, i.e. reflecting the time when it wrote the file (as opposed to the time when the file with those contents entered the repository) and that should fix your woes.

In any case, it is well established by now that the problem is not a bug in Git for Windows, it does exactly what it is designed to do. Therefore I close this ticket.

j66st commented 9 years ago

@dscho: Well, I am just getting my feet wet as a Git user, I think it is not fair to expect me to dive into this huge codebase and start tinkering with the heart of the program. At least it would cost me days, if not weeks to understand the code that contains so mush wisdom of many smart people. I already spent days modding and debugging vss2git to find out the reason for this strange behaviour and spent another day setting up a reproducible testcase. After discussing this issue I was supposing this was a serious flaw. Even long-time Git users are surprised. To quote the reaction of your co-"owner" sschuberth on the google group:

Wow, that was totally new to be and sort of destroys my world picture of Git ... I always thought Git is tracking file contents, and nothing else. (Pasted from https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc )

These words also reflect my opinion. So I thought this was something to be fixed urgently and that it would be wise to humbly leave this work to the developers who are fully familiar with the codebase and who must be able to fix this in hours, rather than weeks. I am willing to discuss the subject, to assist with my knowledge of filesystems and VSS, and to assist in testing, that's all that would make sense for now, I guess.

Anyway, glad that I did not yet invest more time in coding a solution. I don't know about the hierarchy of the people contributing in this project, but you must be the authority here if you can decide on your own that this strange behavior is by design and does not need further action. So if I would develop a fix it wouldn't get your approval anyway.

Now I know enough to find a suitable workaround to redo my migration project. But I think the least one could do then is adding a section to the FAQ and to the manpages to explain and warn for this surprising behavior as shown in my test script. Maybe I should spend time to do that and add my demo-case as an example.

I truly respect people investing so much time, energy and knowledge in open source projects like this, and I apologize if anyone feels insulted by my posts, it was never my intention. It is up to the readers following this discussion to judge for themselves who seems arrogant here.

j66st commented 9 years ago

@sschuberth :

At least the rebuilding part we have made pretty easy by now. Simply download the latest snapshot of the Git for Windows SDK, run the "Git Development Environment" from the Start Menu, and do cd git && make.

Thanks for your good work! I am MS Visual Studio developer, I would have to set up a Gnu C build environment first, and get used to it, at this moment I don't have time for that, already lost too much time with this vss2git migration, and the show must go on. I hope, some day ...

I really wished that Git could preserve timestamps too!

For reference, here's the FAQ entry why it does not. However, I agree there could be an option to still do so.

I know the reasons. To accommodate those people who are too ignorant or too lazy to add two characters to their make command to force a rebuild-all whenever they import other people's code :-) At the cost of losing valuable file timestamps. Fine with me, we can agree to disagree here. I spent considerable time to have vss2git save the timestamps in the commit messages, I think it was worth it.

sschuberth commented 9 years ago

I am MS Visual Studio developer, I would have to set up a Gnu C build environment first

No you don't. That's exactly what the SDK installer does for you, it's self-contained.

j66st commented 9 years ago

@YueLinHo :

Can vss2git keep the changed file list for one commit? If it does not, once lots files in working tree, touching all files is not that smart.

Why? Touching the files will only refresh the mtime in the directory, so not an expensive operation. Vss2git bundles files having mtimes within a certain time interval like 2 minutes, so normally only a dozen files or so per commit. And vss2git is a batch process, it can run overnight, I don't care about a few minutes more.

So, could you please try the following 2 ways after vss checkout?

  1. git rm --cached -r && git reset --mixed

Yes, this works (just tried it with my Demo-setup, you can also try it yourself if you like):

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
nothing to commit, working directory clean

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git rm --cached -r dixilinkerr.h
rm 'dixilinkerr.h'

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ ls -l
total 3
-rw-r--r--    1 joostadm Administ     5986 Sep 17  2009 dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        deleted:    dixilinkerr.h

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git reset --mixed
Unstaged changes after reset:
M       dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$
  1. delete the .git/index, then git reset --mixed

Yes, that is also a viable solution (didn't know that it is fine to delete the index file).

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
nothing to commit, working directory clean

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ ls -l
total 3
-rw-r--r--    1 joostadm Administ     5986 Sep 17  2009 dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ rm .git/index

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        deleted:    dixilinkerr.h

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git reset --mixed
Unstaged changes after reset:
M       dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$

Your workarounds will also cost some time (more than touch, I guess). Anyway, I only need to apply this workaround in those cases that git add fails, so no problem.

Thanks for your ideas!

YueLinHo commented 9 years ago

Can vss2git keep the changed file list for one commit? If it does not, once lots files in working tree, touching all files is not that smart.

Why?

May I take it back? :)

Touching the files will only refresh the mtime in the directory, so not an expensive operation. Vss2git bundles files having mtimes within a certain time interval like 2 minutes, so normally only a dozen files or so per commit. And vss2git is a batch process, it can run overnight, I don't care about a few minutes more.

Your workarounds will also cost some time (more than touch, I guess).

Agree that.

I only need to apply this workaround in those cases that git add fails

What if there are 2 changed files in one commit, A and B, and git status only show B is modified? Then, git add will be not failed, and A is not commit. I think the better workaround is using the "Set File Time" of VSS's checkout to "Current" (Not "last file modification", nor "last file check-in".), IIRC.

Thanks for your ideas!

Yor're welcome! And thanks for your report, because I got another chance to study index file format and relevent code that I want to do about 2 months ago. ^_^

BTW, I am a git user, too.

dscho commented 9 years ago

Even long-time Git users are surprised.

I fear you misunderstood.

Git – as Visual Studio, or make – needs to figure out which files have changed since the last time it saw them. The reason for Git – as for Visual Studio, or make – is that it would be prohibitively costly to just handle all the files again. Visual Studio would have to make a full build otherwise, even if all you did was to fix a typo. And Git would have to re-checksum all the files again – a really wasteful way to handle your files.

That is why Visual Studio, and make, and Git, and everybody else relies on the mtime as a means to say "this is when the file was last modified here".

By the way, that is exactly the same as with backup systems, because backup systems are not intended to synchronize data between two locations. They are intended to reconstruct the exact state of a given working directory as on that machine. That is why they preserve mtime, so that tools like Visual Studio, make – or Git – can say: "Oh, I see, git.c is older than git.o, I guess I do not have to rebuild it".

It is important to realize, and I encourage you to wrap your head around that concept, too, that the time when a given file revision entered the repository is a very different beast from the mtime of the file of the reconstructed revision.

The puzzlement you saw was of some developers who thought that Git would do something different than what the tools expect, namely to play games with the mtime (i.e. change it to something that does not reflect when that file, in that location, was last modified).

But that puzzlement went away when it was clarified that Git does exactly what it is expected to: record source code revisions (and attach a timestamp to the entire revision instead of recording individual mtimes which it is not supposed to).

Can vss2git keep the changed file list for one commit? If it does not, once lots files in working tree, touching all files is not that smart.

Why? Touching the files will only refresh the mtime in the directory, so not an expensive operation.

No, touching is not. But the consequences are. You basically add a lot of churn for the tools that are now fooled into believing that they did not see the most recent contents of the files. You will cause complete builds – which can easily last several hours for complex C++ projects, something that you should not want to cause – or in the case of Git a full reindexing – which can still last several minutes if you have a large project, and doing so for every single revision you want to import will unduly slow down the entire process.

Well, I am just getting my feet wet as a Git user, I think it is not fair to expect me to dive into this huge codebase and start tinkering with the heart of the program. At least it would cost me days, if not weeks to understand the code that contains so mush wisdom of many smart people.

Well, I, in contrast, think it unrealistic to expect others who do not share your particular problem to solve it for you, in particular when the faulty bit – the wrong mtime modification of VSS – has been identified already and you have everything you need in hand to solve the problem yourself.

And I need to protect the time of the Git for Windows developers: we have a couple of challenges on hand that we need to solve to make Git for Windows a better user experience (and fixing vss2git is not our concern, so it is also a bit unfair to distract us from our very own problems, unless you help us with our problems in return, something that you indicated to be unlikely because of your lack of knowledge of the code base that you did not intend to change). There has been substantial help you received from developers who could not spend that time on Git for Windows as a consequence (and the solution to this problem was outlined as much as people not using VSS could), therefore I closed the issue.

I encourage you to heed the advice – ask VSS to give you a list of files it updated, then touch that set of files, in vss2git – and solve your problem. I am sure you will agree that no change in Git, let alone Git for Windows, is required. If you disagree, you should take the discussion to the Git project proper: git@vger.kernel.org.

In closing: you are welcome for all the help you received.

j66st commented 9 years ago

unrealistic to expect others who do not share your particular problem to solve it for you

fixing vss2git is not our concern

If I only came here to have a problem with vss2git fixed, you would be right. But I think you are missing my point.

To summarize: I discovered failures of vss2git to transfer certain files to Git. That boiled down to git refusing to add a file that was visibly different from the latest revision in the repository. I was surprised to see this happen and after seeing that this only happens if the file size matches, I suspected some kind of bypassing reality for performance reasons. I could not believe this, since Git is designed to be tracking contents and nothing else. So I decided to file an issue here. Apparently Git is making assumptions that are not always valid, especially in a Windows environment. It draws a conclusion that a file is unchanged while it is different. That's bad. The excuse might be performance gain, but the price is reliability loss. Every git user should have the option to choose his preference for reliability over speed. I was stunned to see this issue being interpreted as a vss2git-only problem, and being closed by the product owner without further discussion.

Git developers tend to think from a Unix/POSIX/Linux perspective, that's why the index reflects the Unix file attributes, of which most components do not even exist on Windows. They treated mtimes the way they are commonly used in Unix. But this differs from the Windows environment, where the modification time of a file has traditionally a different meaning. That is easily shown by looking at the basic copy commands: Unix' cp touches the mdate; Windows' COPY or XCOPY does not! This behavior is reflected by many file management tools in each of these environments (file managers, archiving tools, version control systems, backup programs etc.). Both approaches have their advantages. (I think that historically the meaning of ctime and mtime has been confused, but that's another discussion.) You cannot deny this different approach, and you cannot force your Unix view onto the Windows world. So when porting Git to Windows I think you should accommodate this different mindset (the same way that the slash/backslash and LF/CRLF differences must be properly handled).

I encourage you to wrap your head around that concept

I don't need an argumentation how mdates affect build tooling. I am an experienced C++ developer, I grew up in Unix before Visual Studio even existed. There are different workflows possible, and in large projects you always will have to decide intelligently whether to force a full rebuild. Many Windows developers consider the modification time of a file as an important attribute to see the age of the code file. (Yes, I know this information would be better stored in the Creation time field, but it is not. And I also know enough arguments why the mtime often won't reflect file age. But many developers use it that way, you can't deny that.) In our practice, the combination of filename+mtime+size is a pretty accurate way to assume that a file is unchanged without checking its contents. But not a guarantee! Similar to the inode that is used in Unix.

So, again, this is not a vss2git-only problem! The assumption that equal filesize and mtime means that a file is unaltered is going to cause problems under Windows (more likely than under Unix). I cannot estimate how often it will occur in the future, but the hard fact is that Git sometimes ignore diffs and that this may cause loss of files.

I am having a hard time "selling" Git to my team who are used to VSS, because (a) from now on they will lose their valuable mtime information, and (b) since yesterday they don't trust that Git will take good care of their data!

So I think that there should be a command line option (and a config setting) to opt for reliability over speed, which can force file inspection and rehashing over mtime-based assumptions. I don't think I am the only one I would welcome a discussion about this among Windows-hosted developers.

I think it does not cost much time to realize this. It may be as simple as just forcing a function to return TRUE when told so by a config flag. If I could complete it myself within a day I would do so and offer it to the community. (I am also willing to offer a day of my time in exchange for someone else doing it.)

And Git would have to re-checksum all the files again – a really wasteful way to handle your files.

Git would need to do so ONLY if size and mtime match, just to be sure.

Even if a general git diff or git status would not see the changes, these commands should have a simple option to look more thoroughly. And git add, when mentioning a specific path, should never take shortcuts and always rehash the files specified, IMO.

Touching the files will only refresh the mtime in the directory, so not an expensive operation.

No, touching is not.

I think an efficiënt touch tool under Windows should do just that.

In closing: don't bother about my vss2git problem, it has been solved, thanks to all who contributed here. Instead, care about 100% reliability of Git in a Windows environment, in workflows that Windows developers are used to. (Where in the Git manuals is specified that git diff cannot be trusted under circumstances? Under Windows this is not just a "racy" problem!)

YueLinHo commented 9 years ago

^_^ Share my note of git index format: git_index_format

@j66st I got another tricky workaround: attack mtime of .git/index (to a very old date) I suppose it will trig the racy condition and let git to check file content only for the file which the mtime, ctime, file size are all identical. It comes from the code in is_racy_timestamp()

        istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec

IOW, making istate->timestamp.sec(mtime of .git/index file) always small than that file's mtime. (but can not be 0)

So, no need to touch files. Just attack one.


Assume git want to support the option, maybe "--no-trust-mtime-of-index-file", I guess the modification to the above code would be

        (no_trust_mtime_of_index_file || istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec)

(These code is base on git 1.9.0. The code has been changed in pu branch, but I suppose the logic is not changed after a quick review.)

j66st commented 9 years ago

@YueLinHo Thanks for the explanation and the good ideas. I hope you did not have to reverse-engineer everything. You can find a clear description of the index file structure in msysgit/git/Documentation/technical/index-format.txt And I assume you also have read msysgit/git/Documentation/technical/racy-git.txt.

Indeed, zeroing the mtime in the index is a way to force considering the file racy without changing git itself. Normally the index will contain more entries, we would have to walk the list and check every file's size and clear the mtime only if the size is equal.

The code change you propose does indeed hit the right spot, IMO. You would also have to include the USE_NSEC case:

static int is_racy_timestamp(const struct index_state *istate,
                 const struct cache_entry *ce)
{
    return (!S_ISGITLINK(ce->ce_mode) &&
        istate->timestamp.sec &&
        (ignore_cached_mtime ||
#ifdef USE_NSEC
         /* nanosecond timestamped files can also be racy! */
        (istate->timestamp.sec < ce->ce_stat_data.sd_mtime.sec ||
         (istate->timestamp.sec == ce->ce_stat_data.sd_mtime.sec &&
          istate->timestamp.nsec <= ce->ce_stat_data.sd_mtime.nsec))
#else
        istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec
#endif
                  )
         );
}

What is left to do then is to wire this option flag to a command-line switch for all relevant commands (I don't know which of the many non-porcelain commands are involved) and to a config setting.

YueLinHo commented 9 years ago

Normally the index will contain more entries, we would have to walk the list and check every file's size and clear the mtime only if the size is equal.

There are 3 kind of mtime. One is for a file in working tree. Another one is for a entry data in index file. Last one is for index file itself. istate->timestamp.sec stands for mtime of index file. (not the mtime value saved in index file.)

So, the "attack" is not walk the all entries data. the attack means just modify the mtime of index file itself.

I hope you did not have to reverse-engineer everything. You can find a clear description of the index file structure in msysgit/git/Documentation/technical/index-format.txt And I assume you also have read msysgit/git/Documentation/technical/racy-git.txt.

Actually, I read them first, then study source code by MSVC debugger (based on TortoiseGit project), then have that index format note.

j66st commented 9 years ago

Aha, you mean not altering the index file contents, but just altering the mtime in the .git directory, something like:

touch -m --date=01/01/1980 .git/index

? So it is a touch, but only a single one, instead of touching all the files in the work dir. You call this 'attack', is that a command to do so?

YueLinHo commented 9 years ago

Yeah, touch only a single one => .git/index

Sorry for the "attack". :)

YueLinHo commented 9 years ago

That's why @kblees point this out:

the dixilinkerr.h file has a modification time of 2009-09-17, and the .git/index file has a modification time of 2015-01-23.

j66st commented 9 years ago

@sschuberth Tried to install the Git-SDK-v0.4 release downloaded here but one module seems to be missing:

mingw-get.exe: * WARNING * https://dl.bintray.com/git-for-windows/sdk-packages/bash-3.1.17-5-msys-1.0.18-bin.tar.lzma: opened with unexpected status: code = 404 mingw-get.exe: * WARNING * please report this to the mingw-get maintainer mingw-get.exe: * ERROR * Get package: https://dl.bintray.com/git-for-windows/sdk-packages/bash-3.1.17-5-msys-1.0.18-bin.tar.lzma: download failed

EDIT: Then I tried the link you gave Git-SDK-v0.4-37-g533f9aa.exe Same problem.

dscho commented 9 years ago

Wrong issue tracker.

sschuberth commented 9 years ago

@j66st The latest "stable" Git SDK download is outdated, please use the latest snapshot from here.

EDIT: Just saw your edit. Actually the snapshot should work fine except a know error like

mingw-get.exe: *** ERROR *** internal package specification error
mingw-get.exe: *** ERROR *** can't get 'tarname' for non-release element <<<unknown>>>
mingw-get.exe: *** ERROR *** please report this to the package maintainer
mingw-get.exe: *** ERROR *** wget-1.12-1-msys-1.0.13-bin.tar.lzma: requires...
mingw-get.exe: *** ERROR *** msys-libopenssl-*-msys-*-dll-100.tar: unresolved dependency (type 'eq')
mingw-get.exe: *** ERROR *** msys-libopenssl-*-msys-*-dll-100.tar: cannot identify any providing package
mingw-get.exe: *** ERROR *** please report this to the package maintainer

which you can safely ignore.

t-b commented 9 years ago

From my POV the issue here needs a proper fix. The solution outlined by Hannes on the mailing list does sound tempting. Now we just need someone to do the job ;)

For a quick and dirty solution, here is a short and workable example on how to add a new config option to git.

j66st commented 9 years ago

@sschuberth The snapshot install bails out immediately with "No packages found, please report this as an error to the developers."

kblees commented 9 years ago

Indeed I guess that the problem could be caused by extracting files from VSS will restore their original mtime (but don't call that "messing up")

The file modification time is updated by the operating system on a successful write() (POSIX) or WriteFile() (Windows) system call. Both POSIX and Win32 documentation are quite clear on this.

You are obviously looking for a time stamp representing the last logical content change by some user (or VSS check-in time or whatever). However, there is no field with such a meaning in file system stat data (neither on Windows nor POSIX/UNIX).

Just that VSS abuses the modification time for this doesn't make it correct or even smart to do so, as it may interfere with a variety of software that expects the documented behaviour.

So please kindly allow me to use terms like "mess up" or "abuse" for this rogue behaviour of VSS.

I really wished that Git could preserve timestamps too! ... I changed the vss2git tool to keep this information at least in the commit message.

Why not use the commit date (git commit --date=...) rather than the commit message?

I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files.

However, it naturally comes with a huge performance penalty (> factor 100), e.g. with the WebKit repo (~200k files, ~2GB data):

$ # cold file system cache, without cached stat data:
$ git read-tree HEAD
$ time git status -s -uno

real    13m32.234s
user    0m0.015s
sys     0m0.000s

$ # hot file system cache, without cached stat data:
$ git read-tree HEAD
$ time git status -s -uno

real    1m35.323s
user    0m0.015s
sys     0m0.000s

$ # hot file system cache, with cached stat data:
$ time git status -s -uno

real    0m0.796s
user    0m0.000s
sys     0m0.000s

From these numbers, its pretty obvious that its just not feasible to have Git look at the content of files that haven't changed.

Adding a command line switch or config option to ignore cached stat data would have to be discussed in the upstream mailing list, but given previous discussions about persisting / restoring mtime, I doubt that such a change would be accepted.

sschuberth commented 9 years ago

@j66st I've triggered a new snapshot build on our CI, which according to the installer test works fine (modulo the mentioned wget error which you can still ignore). You can still use this link to download the new installer as that always points to the latest snapshot build.

kblees commented 9 years ago

@t-b

The solution outlined by Hannes on the mailing list does sound tempting.

As Git did not detect a change in create date either, it seems that VSS just overwrites the file (rather than delete + recreate). So tracking the FileIndex as inode number wouldn't have helped.

AFAIK, only NTFS and ReFS have stable FileIndexes, it makes no sense to track the (generated) FileIndex of FAT volumes or network file systems.

...and _ino_t is just 16 bit on Windows, so we cannot store 64 bit (or 128 bit for ReFS) anyway.

j66st commented 9 years ago

@kblees

The file modification time is updated by the operating system on a successful write() (POSIX) or WriteFile() (Windows) system call. Both POSIX and Win32 documentation are quite clear on this.

Yes, but the CopyFile Windows system call preserves file attributes, including mtime. Clearly documented. Most file managers (Windows Explorer, Total Commander, etc.), unzippers, backup-restore tools will do so. If you consider VSS a tool to restore an old version, it is correct to also restore the file attributes. But this is configurable. Vss2git, though, does not use VSS, it replays history itself by directly accessing the VSS repository at the file level.

But forget about VSS and vss2git. That problem has been solved. I am worried that this problem might occur in daily use of Git in a Windows context, where copying files from flash drives, network drives, ZIP archives always preserves/restores the original mtime. I want an option to bypass the Git-assumption that it thinks to "know" files which are older than the index. This option may then be off by default, so no overhead at all for users who want to accept the small risk.

I really wished that Git could preserve timestamps too! ... I changed the vss2git tool to keep this information at least in the commit message.

Why not use the commit date (git commit --date=...) rather than the commit message?

I do so. But the commit date may differ considerably from the file's mtime, and the mtime differs between files. So I let vss2git in addition save a listing of file mtimes in the commit message. For the future (as soon as all team members want to use Git) the commit timestamps may be indeed sufficient, although loss of the real mtimes is still a pain.

I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files. However, it naturally comes with a huge performance penalty

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

given previous discussions about persisting / restoring mtime

I would appreciate if someone can point me to relevant discussions about mtimes and Git in a Windows environment, to enlighten me; apart from my vss2git problems I still can't estimate how big a risk ignoring this issue would be for the future.

kusma commented 9 years ago

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files. However, it naturally comes with a huge performance penalty

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

Huh, are you saying the files who have changed their size is the common case...?!

j66st commented 9 years ago

Huh, are you saying the files who have changed their size is the common case...?!

Source code (text) files which are edited most of the time will get a different size. Binary files like icon bitmaps normally keep the same size. So by calling "git add" for a file the caller indicates that the file is added or modified; Git should then first compare the file size, only if the size is unchanged the index should not be trusted (because of Git's caching habits) and the file must be inspected.

kblees commented 9 years ago

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

You typically only work on a small portion of a project, and the majority of files are unchanged (i.e. same size, modification time, creation time and content). Always looking at the content of those unchanged files has quite noticeable performance impact, as measured above (0.7s vs. 95s).

Git should then first compare the file size, only if different the index should not be trusted and the file must be inspected.

This is what Git already does: if the size is different, Git knows for sure that the content changed (without even looking).

But it seems you're contradicting yourself here? Do you want to check files if the size has not been changed, or if the size is different?

j66st commented 9 years ago

@kblees Sorry for the confusion! I edited a mistake in my message a minute after posting, but you apparently got the text before that change. Please read my last paragraph again:

So by calling "git add" for a file the caller indicates that the file is added or modified; Git should then first compare the file size, only if the size is unchanged the index should not be trusted (because of Git's caching habits) and the file must be inspected.

So only added files have to be inspected, all "sleeping" files are left alone.

t-b commented 9 years ago

@kbless thanks for the pointers regarding FileIndexes. I would be okay if the solution does not enhance behaviour on FAT.

t-b commented 9 years ago

Can we agree on a minimal test case which shows the "unwanted" behaviour?

@j66st Do you think the following should result in git saying that file.txt was modified?

mkdir myrepo
cd myrepo
git init
echo "file contents" > file.txt
touch -t 201501011212 file.txt
git add file.txt
git commit -m "changes"
ed -s file.txt <<< $' 1,$s/file/abcd/g\nw'
touch -t 201501011212 file.txt
git status

The fancy ed line is just a way to change the file without changing its inode number. On my linux box with git 2.1.0 git does not recognize that file.txt has different contents.

It gives:

Initialized empty Git repository in /home/wg/thomas/devel/myrepo/.git/
1261690 file.txt
[master (root-commit) 61398a3] changes
 1 file changed, 1 insertion(+)
 create mode 100644 file.txt
1261690 file.txt
On branch master
nothing to commit, working directory clean
j66st commented 9 years ago

The discussion now is branching, but I hope you keep focus on the user's POV. So please step back for a minute and listen to this user story:

" I ran into the following problem with Git: I wanted to update a file of which a version is already in the repository. So I run the command "git add myfile". Looks OK, no error or warning returned. After adding some more files I do a "git commit". I get the normal success response. A few days later a build behaves strange: it turns out that myfile was never updated! Reconstructing the situation made me to reproduce the problem. I also found out that the problem only occurs if the size of the updated file matches the size of the file in the repository. "git status" and "git diff" see no difference. But with a simple "git show HEAD:myfile" I can easily prove that myfile is different.

This smells like Git is taking a shortcut and does not actually read myfile. After discussing this with a long-time Git user I decided to make a simple repro case and submit a bug report. Got suprised reactions even from some Git developers. The fog started clearing as the "racy git problem" was brought in.

Then the Product Master came in. From his throne in Linux Heaven he looked down to me, humble noob in Windows Hell. He said that such things only happen in Windows where sinners think they can turn back time. But it would be unlikely to occur anyway, and we have to live with it. After preaching extensively about how mtimes affect build tools He concluded with something like "Thou Shalt Not Play With Mtimes" and "This Is How Git Is Designed", arrogantly made clear that I already wasted too much of his and his disciples' valuable time . He pointed His Thumb down and closed the issue.

Then after a short silence slowly the mumbling began in the forum. Some good ideas bubbled to the surface. Several people offered help.

Am I, humble Git noob, asking too much? If I ask "git add myfile" isn't it clear what I want? Is it unlikely that the file has been modified? Is there ANY excuse (apart from a hash collision) for silently NOT updating my file if it is different? Some disciples started arguing about caching, filesize, inodes, mtimes (some of which is irrelevant in Windows hell). Others mention a big performance penalty that would be incurred. Heck, even if the repository is huge and my working tree is huge: I am asking to update only ONE file. Is it too much trouble for Git to even look at the file? Yes. Instead Git tries to be smart by looking at the directory information and draws the wrong conclusion.

So I think "git add" should always read the file(s) that I want to add before concluding that they are unchanged. I can accept that "git status" (which must be fast and has to observe my complete working tree) takes a shortcut and use cached stats and occasionally misses a change. Even a global "git diff" might not notice the change. But also "git diff myfile" should not be lazy and actually read the file. Git may try to be smarter than it is right now and update the cache with new wisdom obtained.

So if I am alone with my "woes", I will have to find my own solution. Git is open source, so I can make my private fix. If other Git users feel affected, then maybe we can join forces to find a suitable solution (like an option --i-confess-i-am-a-sinner-so-please-ignore-mtime) in a way that does not itch too much in heaven and makes hell still a place worth living in. "

OK guys, thanks for your time, back to work.

P.S.: Yes, I know I can go for a full refund.

j66st commented 9 years ago

@t-b

Can we agree on a minimal test case which shows the "unwanted" behaviour?

@j66st Do you think the following should result in git saying that file.txt was modified?

mkdir myrepo cd myrepo git init echo "file contents" > file.txt touch -t 201501011212 file.txt git add file.txt git commit -m "changes" ed -s file.txt <<< $' 1,$s/file/abcd/g\nw' touch -t 201501011212 file.txt git status

I cannot test it with the ed line (ed is not included in msysgit) but if I replace that line with

echo "abcd contents" > file.txt

then I get similar output under Windows. So this is a good testcase.

t-b commented 9 years ago

@j66st If my test case shows in your eyes unwanted behvaiour this is a upstream's git behaviour.

This behaviour is a bigger problem on windows as we have only time stamps with second precision and don't include the inode comparison check. Both things could be changed in windows if someone steps volunteering to do or pay.

YueLinHo commented 9 years ago

So please step back for a minute and listen to this user story: " I ran into...

^_^ Nice story. @j66st you are a good story teller.