Git does not see that file in working directory differs from HEAD

j66st commented 9 years ago

I found a strange problem in msysGit, I am wondering if it's a bug. I already discussed it here: https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc You can download my Demo.zip attachment there. It seems not to happen in Git for Linux or OS-X. The problem occurs in msysGit 1.9.0 and 1.9.5 (running x86 version on a Windows 7 x64 system).

Essentially, the problem seems that Git for Windows assumes that two files are the same when both the timestamp and the file size match. Obviously the file contents is not inspected nor the hash recalculated.

As mentioned I created a minimal demo package to easily prove the issue. Simply download the zip file, put it in a clean directory and run the enclosed script from bash. Below is a transcript of what happens when I run the script in my situation.

The issue is a real show-stopper in my automated migration from Visual SourceSafe to Git (in a test transferring 20,000 source files the problem caused loss of roughly 0.1% of the files from the history), so I hope for a quick solution.

Welcome to Git (version 1.9.5-preview20141217)

Run 'git help git' to display the help index.
Run 'git help <command>' to display help for specific commands.

joostadm@MUPC20 ~
$ cd /d/proj/try/git

joostadm@MUPC20 /d/proj/try/git
$ ls -l
total 15
-rw-r--r--    1 joostadm Administ    26968 Jan 23 18:43 DixiLink2.zip
-rwxr-xr-x    1 joostadm Administ     1848 Feb  5 14:05 weird-git-demo

joostadm@MUPC20 /d/proj/try/git
$ weird-git-demo
Note: No argument supplied, using DemoRepo by default.
+ mkdir DemoRepo
+ cd DemoRepo
+ unzip ../DixiLink2.zip
Archive:  ../DixiLink2.zip
   creating: .git/
 extracting: .git/COMMIT_EDITMSG
  inflating: .git/config
  inflating: .git/description
  inflating: .git/gitk.cache
 extracting: .git/HEAD
   creating: .git/hooks/
  inflating: .git/hooks/applypatch-msg.sample
  inflating: .git/hooks/commit-msg.sample
  inflating: .git/hooks/post-update.sample
  inflating: .git/hooks/pre-applypatch.sample
  inflating: .git/hooks/pre-commit.sample
  inflating: .git/hooks/pre-push.sample
  inflating: .git/hooks/pre-rebase.sample
  inflating: .git/hooks/prepare-commit-msg.sample
  inflating: .git/hooks/update.sample
  inflating: .git/index
   creating: .git/info/
  inflating: .git/info/exclude
   creating: .git/logs/
  inflating: .git/logs/HEAD
   creating: .git/logs/refs/
   creating: .git/logs/refs/heads/
  inflating: .git/logs/refs/heads/master
   creating: .git/objects/
   creating: .git/objects/29/
 extracting: .git/objects/29/c51bbb9ada43dbe98cbd5dbedbab56586b24c3
   creating: .git/objects/68/
 extracting: .git/objects/68/b1f129eb06b91fc6c9a3885fc0d24ad0cdaa50
   creating: .git/objects/7e/
 extracting: .git/objects/7e/8d94849a38370141b3ca2aab5c3cadc27934da
   creating: .git/objects/cf/
 extracting: .git/objects/cf/13dbb766d4aff04ad51d3cdac84fda67dc6f50
   creating: .git/objects/da/
 extracting: .git/objects/da/eca6b50a218a78e4cca5566965381e2d384b7f
   creating: .git/objects/de/
 extracting: .git/objects/de/3643596c73be4f6112d027616c8df31acd1b09
   creating: .git/objects/ec/
 extracting: .git/objects/ec/88fb155c5e9ebd8b3ead39345618517c0af6cf
   creating: .git/objects/info/
   creating: .git/objects/pack/
   creating: .git/refs/
   creating: .git/refs/heads/
 extracting: .git/refs/heads/master
   creating: .git/refs/tags/
 extracting: .git/refs/tags/2011-11-24
  inflating: dixilinkerr.h
+ echo 'This repository now contains version 1 and 2 of dixilinkerr.h:'
This repository now contains version 1 and 2 of dixilinkerr.h:
+ git log
commit cf13dbb766d4aff04ad51d3cdac84fda67dc6f50
Author: Joost <joost@localhost>
Date:   Tue Sep 11 09:57:32 2007 +0000

    @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h

commit de3643596c73be4f6112d027616c8df31acd1b09
Author: Joost <joost@localhost>
Date:   Fri Jul 6 12:27:57 2007 +0000

    @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
+ echo 'Version 3 is in our working directory:'
Version 3 is in our working directory:
+ head -n 12 dixilinkerr.h
// DixiLinkErr.h

#pragma once

#ifndef EXCPCAT_DIXILINK
#define EXCPCAT_DIXILINK 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'The file in our HEAD'
The file in our HEAD
+ git show HEAD:dixilinkerr.h
+ head -n 12
// DixiLinkErr.h

#pragma once

#ifndef CAT_DIXILINK_ERR
#define CAT_DIXILINK_ERR 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'You see? Working copy differs from HEAD.'
You see? Working copy differs from HEAD.
+ echo 'So the working directory is dirty, right? Ask Git:'
So the working directory is dirty, right? Ask Git:
+ git status
On branch master
nothing to commit, working directory clean
+ git diff
+ echo 'In my situation, Git sees no difference here, I THINK THIS IS WRONG!'
In my situation, Git sees no difference here, I THINK THIS IS WRONG!
+ echo 'So we are unable to add version 3 of our file. Let'\''s try it once again:'
So we are unable to add version 3 of our file. Let's try it once again:
+ git add dixilinkerr.h
+ git status
On branch master
nothing to commit, working directory clean
+ echo 'In my situation at this point there is nothing to add or commit.'
In my situation at this point there is nothing to add or commit.
+ echo 'End of demo'
End of demo

joostadm@MUPC20 /d/proj/try/git
$

kblees commented 9 years ago

Yes, but the CopyFile Windows system call preserves file attributes,

Which is not true, CopyFile always sets the archive attribute.

including mtime. Clearly documented.

The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.

The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.

AFAICT, the behaviour is also file system specific (e.g. last accessed time behaves differently on FAT and NTFS if NtfsDisableLastAccessUpdate=1 in the registry).

I am worried that this problem might occur in daily use of Git in a Windows context, where copying files from flash drives, network drives, ZIP archives always preserves/restores the original mtime.

Even if CopyFile preserves mtime, you still need a file of same size and mtime but different content. Which IMO is very unlikely to happen in a normal development workflow. E.g. if you copy a file to the network, edit it there and copy it back, the mtime will have changed.

Apart from vss2git, the only currently known way to reproduce the problem requires a special Unix tool (touch -t), for which there isn't even an equivalent on Windows. If there was a use case that triggered the problem (without resetting mtime on purpose), I would agree that it is worth fixing...otherwise its more of a theoretical exercise.

Am I, humble Git noob, asking too much? If I ask "git add myfile" isn't it clear what I want? Is it unlikely that the file has been modified? Is there ANY excuse (apart from a hash collision) for silently NOT updating my file if it is different?

You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add . would suffer the same performance penalties as git status.

So, dear humble Git noob, if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile???

kblees commented 9 years ago

This behaviour is a bigger problem on windows as we have only time stamps with second precision

The one second granularity is due to a former POSIX limitation of the stat structure (changed in 2013 to include nanoseconds). NTFS time stamp resolution is 100ns.

Tracking file times with nanosecond precision would make the "same mtime by chance" case even more unlikely than it already is. It won't fix the "reset mtime on purpose" case, though.

and don't include the inode comparison check.

Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the problems discussed here.

That being said, implementing nanosecond precision is reasonably trivial and won't hurt performance, so I gave it a shot.

j66st commented 9 years ago

@kblees

The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.

The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.

I know, I agree it has always been a mess. But it has been this way since the DOS days, I don't expect MS to change it. Ctimes under Windows are hardly used by anyone, I guess. It would be useful to separately keep content modification timestamp and last write timestamps. Fact is that many Windows developers have a workflow where mtime is used as an always visible content modification time. Moving to Git will change that because Git (from their POV) messes up the mtimes. Major pain was, after the vss2git misbehavior, to find out which files were lost, because we couldn't use mtime anymore to identify if the working copy of a file was the most recent. The problem typically occurred with icon bitmaps, where only a few pixels were edited, so a simple diff is useless.

You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add . would suffer the same performance penalties as git status.

I know. But if I ask to "git add ." I will accept some delay because then I am asking for inspecting every file.

if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile???

I'm still not convinced that in my vss2git case there ever existed two files with same mtime and different content. I only would expect such thing to happen in a "racy" case with mtimes that are current, but then the index in my sample would contain a fresh timestamp. My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample. To find out, I will have to do a vss2git run from the debugger, break before every "git add" and take snapshots of the working directory and git repo. I'm too busy right now, but I will try that soon.

I agree, doing a "touch myfile && git add myfile" would be OK.

YueLinHo commented 9 years ago

My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample.

1. Assume: (1) vss2git checkout file with current time (2) vss2git converts 2 revisions within 1 second Then, it should be captured by final racy-git check (that said, index file mtime <= file mtime) But, in your Demo.zip, the file mtime is 2009-09-17. Obviously, not current time.

2. Assume vss2git checkout file with modification time or checkin-time And you say: clearly distinct mtimes in the past Then, it should captured by the check of different mtime between the file and related index entry data. (But, In your Demo.zip, mtime are the same.)

Something wrong... :-/ @kblees Do I miss something? :)

@j66st few questions: What kinds of file mtime does vss2git checkout? current time? modification? or ... ? Is the Demo.zip exactly the situation you got? How do you make that Demo.zip?

j66st commented 9 years ago

@YueLinHo : Here is the export section of the vss2git log file of the session where I built the Demo.zip from. Demo.zip simply contains the resulting directory. You can see every step, all Git commands and their response have been logged.

Initializing Git repository
Executing: C:\Program Files (x86)\Git\cmd\git.exe init
>Initialized empty Git repository in c:/proj/try/git/clonefault/DixiLink2/.git/

Replaying changeset 1 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Create SuniLink

Replaying changeset 2 from 07/06/2007 14:27:57
------------------------------------------------------------
Preprocessing shared file mappings

*+ $SuniLink(RKDAAAAA): Share dixilinkerr.h(ZJCAAAAA)
------------------------------------------------------------
*#File Create dixilinkerr.h(ZJCAAAAA)@1 by Joost at 2007-07-06T14:27:57
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Create revision 1
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 2 from 07/06/2007 14:27:57
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpAD3.tmp
>[master (root-commit) de36435] @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
> 1 file changed, 139 insertions(+)
> create mode 100644 dixilinkerr.h

Replaying changeset 3 from 09/11/2007 11:57:32
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 2
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 3 from 09/11/2007 11:57:32
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpA32C.tmp
>[master cf13dbb] @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
> 1 file changed, 1 deletion(-)

Replaying changeset 4 from 09/17/2009 11:43:39
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 3
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 4 from 09/17/2009 11:43:39
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 17-09-2009 11:43:17 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpF207.tmp
>On branch master
>nothing to commit, working directory clean

Replaying changeset 5 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Share dixilinkerr.h

Replaying changeset 6 from 11/24/2011 13:28:48
Creating tag 2011-11-24
Generating temp file for comment [...]
Executing: C:\Program Files (x86)\Git\cmd\git.exe tag -F C:\Users\joostadm\AppData\Local\Temp\tmpEDF.tmp -- 2011-11-24
------------------------------------------------------------
Git export complete in 00:35:45
Replay time: 00:00:00
Git time: 00:00:01

The actual timestamps are not logged. I will do a new run to find out.

kblees commented 9 years ago

In your .git/index file, both ctime and mtime are 0x4ab204b5 = 2009-09-17T09:43:17 (see YueLinHo's screenshot above), which seems to be the mtime of V3.

From the vss2git sources (https://code.google.com/p/vss2git/source/browse/Vss2Git/GitExporter.cs#689), at least the ctime should have been that of V1 (2007-06-01... ~= 0x465f604d).

This also means that when git saw the content of V2, the file already had mtime/ctime of V3, so something is seriously messing up your file times here.

The timestamps you dump to the commit message seem to be correct, where do you get these from (file system or vss2git classes)? Is it possible that your 'dump file times' patch screwed things up?

t-b commented 9 years ago

@kbess said:

Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the > problems discussed here.

Hmm yes. Thanks for the nanoseconds patch btw.

YueLinHo commented 9 years ago

The logic I realized from vanilla git code:

for each entry data in index and file stat in working tree
{
    if (is mtime different)
        changed |= MTIME_CHANGED;
    if (is ctime different)
        changed |= CTIME_CHANGED;

    if (is mtime nanosec different)
        changed |= MTIME_CHANGED;
    if (is ctime nanosec different)
        changed |= CTIME_CHANGED;

    if (is uid gid different)
        changed |= OWNER_CHANGED;
    if (is ino different)
        changed |= INODE_CHANGED;
    if (is dev different)
        changed |= INODE_CHANGED;

    if (is size different)
        changed |= DATA_CHANGED;

    // the 1st checking for racy-git problem
    if (entry data file size is zero)
        if (!is_empty_blob_sha1(entry data sha1))
            changed |= DATA_CHANGED;

    // the 2nd checking for racy-git problem
    if (!changed && index_file_mtime <= entry_data_mtime)
        changed |= ce_modified_check_fs();  // this DOSE check file content
}

j66st commented 9 years ago

@YueLinHo

The logic I realized from vanilla git code ...

You picture it very clear in this block of pseudo-code!

Yes, this is also the logic I found out, it took me quite some time before I understood the role of the cache entries, the timestamp of the index file itself in relation to the working tree.

@kblees

Is it possible that your 'dump file times' patch screwed things up?

Yes, it turned out that my patched version used a field that contains the archived mtime of the last version of the file (even during replay of older versions) to set the mtime of the reconstructed file. My intention was to have the working directory reflect the mtimes we are used to. And it did. I did not check mtimes of the intermediary reconstucted versions as I did not care because they are not stored by Git anyway. I now understand that this caused the "racy" condition which confuses Git. Since I discovered that Git will mess up the mtimes in my working directory anyway with every checkout of a different branch, it makes no sense to preserve mtimes any longer. So I now changed vss2git back so that the intermediary mtimes will reflect the changeset time (also used as the commit date in Git). This is the safest bet to avoid any racy conditions, because vss2git's changesets are by definition distinct in time.

vss2git is a complicated program, basically it imitates the full retrieval logic of VSS. I did not study its data structures in full depth. My major goal to patch vss2git was to better handle Shared and Branched files in the VSS repo (note these words have a different meaning in Git) because otherwise most of the history would not be transferred. And the second goal was to keep the original mtimes in the commit message, because some team members are not yet ready to change their workflow and want to see the last real modification time.

To conclude:

First I was stunned to see file changes to be silently ignored by Git. And to see the problem being waved aside by the community.
Since my files had the proper mtimes, I did not expect vss2git was misbehaving, but now I finally understand the logic of the cache and the racy problem, I know that I unconsciously caused a racy condition.
Now I better understand what's going on, I don't think it is likely for such thing to happen in daily use. The nanosecond fix even further reduces the likelihood for certain filesystems. So I think indeed it is not worth the time to further fix it. Although it might make sense to mention it briefly in the manual that mtimes/filesize may have unexpected effects on Git's caching. The pseudologic of @YueLinHo 's post is a concise and clear explanation. The racy-git.txt is a good, detailed document for insiders, but new Git users seeing the symptoms won't immediately associate the problems they run into with the word "racy" and find this document.
I still think that some extension to Git that would preserve mtimes would be welcomed by many Windows users.

Everyone who contributed, thanks a lot for your help and patience!

YueLinHo commented 9 years ago

I learned what I want and enjoy these conversation from you top guys, so thank all of you here. ^o^d 善哉!

webmaster33 commented 9 years ago

I have a similar case, which may belong here. I wrote a search & replace perl script, which recursively searches files and replaces text in them. After replace, it restores original modification time (mtime) of file.

Interesting, that git status doesn't show replaced changes, if the mtime is same as original.

Is there a way to force git status to show changes, even if the file dates are the same?

I tried to set core to: trustctime = false checkStat = minimal Unfortunately the change is still not detected :( It seems isn't a way to force fallback to file checking and completely ignore file modification date :(

The following solutions makes the changed files appear again as changed: a) touch -m --date=01/01/1980 .git/index So it is a touch, but only a single one, instead of touching all the files in the work dir.

b) git read-tree HEAD Also working well.

But these solutions are just workarounds, not the real permanent solutions.

dscho commented 9 years ago

I wrote a search & replace perl script, which recursively searches files and replaces text in them. After replace, it restores original modification time (mtime) of file.

By this "restoring" of the original modification time you broke the contract: the mtime should reflect the time of the latest change. You replaced something, i.e. changed the file contents. Git expects the mtime to be adjusted in that case. By painstakingly faking it back to its original value you essentially told Git: don't worry, this file has not changed since you last saw it.

There is nothing Git can do to outguess you when you go out of your way to break the most fundamental promise of the mtime value.

PhilipOakley commented 9 years ago

@webmaster33 The issue, as I see it, is that currently Git does not expect this behaviour, and as such has no special mechanism for a forced add of a filename, with an identical mtime (etc) stat but different content, to the staging area, such that when the next commit is made, the new sha1 hash, and content, is used.

I would not expect that Git would ever want to try and detect a content change, without mtime (etc) stat change, and try to show it as 'changed', i.e. unstaged. I'd hope it could allow a forced add (but someone [you?] would have to code it) to allow these special cases where you/your code already knows that the file has changed and the update can be forced so that the 'unstaged' indication would never show itself. (note there is a separation / split of concerns, so the suggested solution changed)

The main use case here appears to be to shift data from one version control system to another, and retain a coarse mtime value held by the old system (probably as a text field, as git does not itself record it) when recreating revisions in the git system, and what is wanted is a "this is what it is, write it, blinkers-on" approach to copying the data across.

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]

YueLinHo commented 9 years ago

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]

My memo: git hash-object -w git update-index --add --cacheinfo git write-tree git commit-tree

From: Chapter 10, Pro Git 2 Book

PhilipOakley commented 9 years ago

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name >> of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read] My memo: git hash-object -w git update-index --add --cacheinfo git write-tree git commit-tree

From: Chapter 10 of Pro Git 2 Book

Thanks, that's useful.

Lets hope it helps the OP with the macro/script for the transfer from VSS... yeah.

Philip

msysgit / git

Git does not see that file in working directory differs from HEAD #312