Closed j66st closed 9 years ago
Yes, but the CopyFile Windows system call preserves file attributes,
Which is not true, CopyFile always sets the archive attribute.
including mtime. Clearly documented.
The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.
The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.
AFAICT, the behaviour is also file system specific (e.g. last accessed time behaves differently on FAT and NTFS if NtfsDisableLastAccessUpdate=1 in the registry).
I am worried that this problem might occur in daily use of Git in a Windows context, where copying files from flash drives, network drives, ZIP archives always preserves/restores the original mtime.
Even if CopyFile preserves mtime, you still need a file of same size and mtime but different content. Which IMO is very unlikely to happen in a normal development workflow. E.g. if you copy a file to the network, edit it there and copy it back, the mtime will have changed.
Apart from vss2git, the only currently known way to reproduce the problem requires a special Unix tool (touch -t
), for which there isn't even an equivalent on Windows. If there was a use case that triggered the problem (without resetting mtime on purpose), I would agree that it is worth fixing...otherwise its more of a theoretical exercise.
Am I, humble Git noob, asking too much? If I ask "git add myfile" isn't it clear what I want? Is it unlikely that the file has been modified? Is there ANY excuse (apart from a hash collision) for silently NOT updating my file if it is different?
You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add .
would suffer the same performance penalties as git status.
So, dear humble Git noob, if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile
???
This behaviour is a bigger problem on windows as we have only time stamps with second precision
The one second granularity is due to a former POSIX limitation of the stat structure (changed in 2013 to include nanoseconds). NTFS time stamp resolution is 100ns.
Tracking file times with nanosecond precision would make the "same mtime by chance" case even more unlikely than it already is. It won't fix the "reset mtime on purpose" case, though.
and don't include the inode comparison check.
Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the problems discussed here.
That being said, implementing nanosecond precision is reasonably trivial and won't hurt performance, so I gave it a shot.
@kblees
The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.
The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.
I know, I agree it has always been a mess. But it has been this way since the DOS days, I don't expect MS to change it. Ctimes under Windows are hardly used by anyone, I guess. It would be useful to separately keep content modification timestamp and last write timestamps. Fact is that many Windows developers have a workflow where mtime is used as an always visible content modification time. Moving to Git will change that because Git (from their POV) messes up the mtimes. Major pain was, after the vss2git misbehavior, to find out which files were lost, because we couldn't use mtime anymore to identify if the working copy of a file was the most recent. The problem typically occurred with icon bitmaps, where only a few pixels were edited, so a simple diff is useless.
You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add . would suffer the same performance penalties as git status.
I know. But if I ask to "git add ." I will accept some delay because then I am asking for inspecting every file.
if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile???
I'm still not convinced that in my vss2git case there ever existed two files with same mtime and different content. I only would expect such thing to happen in a "racy" case with mtimes that are current, but then the index in my sample would contain a fresh timestamp. My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample. To find out, I will have to do a vss2git run from the debugger, break before every "git add" and take snapshots of the working directory and git repo. I'm too busy right now, but I will try that soon.
I agree, doing a "touch myfile && git add myfile" would be OK.
My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample.
1. Assume: (1) vss2git checkout file with current time (2) vss2git converts 2 revisions within 1 second Then, it should be captured by final racy-git check (that said, index file mtime <= file mtime) But, in your Demo.zip, the file mtime is 2009-09-17. Obviously, not current time.
2. Assume vss2git checkout file with modification time or checkin-time And you say: clearly distinct mtimes in the past Then, it should captured by the check of different mtime between the file and related index entry data. (But, In your Demo.zip, mtime are the same.)
Something wrong... :-/ @kblees Do I miss something? :)
@j66st few questions: What kinds of file mtime does vss2git checkout? current time? modification? or ... ? Is the Demo.zip exactly the situation you got? How do you make that Demo.zip?
@YueLinHo : Here is the export section of the vss2git log file of the session where I built the Demo.zip from. Demo.zip simply contains the resulting directory. You can see every step, all Git commands and their response have been logged.
Initializing Git repository
Executing: C:\Program Files (x86)\Git\cmd\git.exe init
>Initialized empty Git repository in c:/proj/try/git/clonefault/DixiLink2/.git/
Replaying changeset 1 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Create SuniLink
Replaying changeset 2 from 07/06/2007 14:27:57
------------------------------------------------------------
Preprocessing shared file mappings
*+ $SuniLink(RKDAAAAA): Share dixilinkerr.h(ZJCAAAAA)
------------------------------------------------------------
*#File Create dixilinkerr.h(ZJCAAAAA)@1 by Joost at 2007-07-06T14:27:57
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Create revision 1
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 2 from 07/06/2007 14:27:57
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpAD3.tmp
>[master (root-commit) de36435] @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
> 1 file changed, 139 insertions(+)
> create mode 100644 dixilinkerr.h
Replaying changeset 3 from 09/11/2007 11:57:32
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 2
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 3 from 09/11/2007 11:57:32
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpA32C.tmp
>[master cf13dbb] @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
> 1 file changed, 1 deletion(-)
Replaying changeset 4 from 09/17/2009 11:43:39
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 3
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 4 from 09/17/2009 11:43:39
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 17-09-2009 11:43:17 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpF207.tmp
>On branch master
>nothing to commit, working directory clean
Replaying changeset 5 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Share dixilinkerr.h
Replaying changeset 6 from 11/24/2011 13:28:48
Creating tag 2011-11-24
Generating temp file for comment [...]
Executing: C:\Program Files (x86)\Git\cmd\git.exe tag -F C:\Users\joostadm\AppData\Local\Temp\tmpEDF.tmp -- 2011-11-24
------------------------------------------------------------
Git export complete in 00:35:45
Replay time: 00:00:00
Git time: 00:00:01
The actual timestamps are not logged. I will do a new run to find out.
In your .git/index file, both ctime and mtime are 0x4ab204b5 = 2009-09-17T09:43:17 (see YueLinHo's screenshot above), which seems to be the mtime of V3.
From the vss2git sources (https://code.google.com/p/vss2git/source/browse/Vss2Git/GitExporter.cs#689), at least the ctime should have been that of V1 (2007-06-01... ~= 0x465f604d).
This also means that when git saw the content of V2, the file already had mtime/ctime of V3, so something is seriously messing up your file times here.
The timestamps you dump to the commit message seem to be correct, where do you get these from (file system or vss2git classes)? Is it possible that your 'dump file times' patch screwed things up?
@kbess said:
Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the > problems discussed here.
Hmm yes. Thanks for the nanoseconds patch btw.
The logic I realized from vanilla git code:
for each entry data in index and file stat in working tree
{
if (is mtime different)
changed |= MTIME_CHANGED;
if (is ctime different)
changed |= CTIME_CHANGED;
if (is mtime nanosec different)
changed |= MTIME_CHANGED;
if (is ctime nanosec different)
changed |= CTIME_CHANGED;
if (is uid gid different)
changed |= OWNER_CHANGED;
if (is ino different)
changed |= INODE_CHANGED;
if (is dev different)
changed |= INODE_CHANGED;
if (is size different)
changed |= DATA_CHANGED;
// the 1st checking for racy-git problem
if (entry data file size is zero)
if (!is_empty_blob_sha1(entry data sha1))
changed |= DATA_CHANGED;
// the 2nd checking for racy-git problem
if (!changed && index_file_mtime <= entry_data_mtime)
changed |= ce_modified_check_fs(); // this DOSE check file content
}
@YueLinHo
The logic I realized from vanilla git code ...
You picture it very clear in this block of pseudo-code!
Yes, this is also the logic I found out, it took me quite some time before I understood the role of the cache entries, the timestamp of the index file itself in relation to the working tree.
@kblees
Is it possible that your 'dump file times' patch screwed things up?
Yes, it turned out that my patched version used a field that contains the archived mtime of the last version of the file (even during replay of older versions) to set the mtime of the reconstructed file. My intention was to have the working directory reflect the mtimes we are used to. And it did. I did not check mtimes of the intermediary reconstucted versions as I did not care because they are not stored by Git anyway. I now understand that this caused the "racy" condition which confuses Git. Since I discovered that Git will mess up the mtimes in my working directory anyway with every checkout of a different branch, it makes no sense to preserve mtimes any longer. So I now changed vss2git back so that the intermediary mtimes will reflect the changeset time (also used as the commit date in Git). This is the safest bet to avoid any racy conditions, because vss2git's changesets are by definition distinct in time.
vss2git is a complicated program, basically it imitates the full retrieval logic of VSS. I did not study its data structures in full depth. My major goal to patch vss2git was to better handle Shared and Branched files in the VSS repo (note these words have a different meaning in Git) because otherwise most of the history would not be transferred. And the second goal was to keep the original mtimes in the commit message, because some team members are not yet ready to change their workflow and want to see the last real modification time.
To conclude:
Everyone who contributed, thanks a lot for your help and patience!
I learned what I want and enjoy these conversation from you top guys, so thank all of you here. ^o^d 善哉!
I have a similar case, which may belong here. I wrote a search & replace perl script, which recursively searches files and replaces text in them. After replace, it restores original modification time (mtime) of file.
Interesting, that git status doesn't show replaced changes, if the mtime is same as original.
Is there a way to force git status to show changes, even if the file dates are the same?
I tried to set core to: trustctime = false checkStat = minimal Unfortunately the change is still not detected :( It seems isn't a way to force fallback to file checking and completely ignore file modification date :(
The following solutions makes the changed files appear again as changed: a) touch -m --date=01/01/1980 .git/index So it is a touch, but only a single one, instead of touching all the files in the work dir.
b) git read-tree HEAD Also working well.
But these solutions are just workarounds, not the real permanent solutions.
I wrote a search & replace perl script, which recursively searches files and replaces text in them. After replace, it restores original modification time (mtime) of file.
By this "restoring" of the original modification time you broke the contract: the mtime should reflect the time of the latest change. You replaced something, i.e. changed the file contents. Git expects the mtime to be adjusted in that case. By painstakingly faking it back to its original value you essentially told Git: don't worry, this file has not changed since you last saw it.
There is nothing Git can do to outguess you when you go out of your way to break the most fundamental promise of the mtime value.
@webmaster33 The issue, as I see it, is that currently Git does not expect this behaviour, and as such has no special mechanism for a forced add of a filename, with an identical mtime (etc) stat but different content, to the staging area, such that when the next commit is made, the new sha1 hash, and content, is used.
I would not expect that Git would ever want to try and detect a content change, without mtime (etc) stat change, and try to show it as 'changed', i.e. unstaged. I'd hope it could allow a forced add (but someone [you?] would have to code it) to allow these special cases where you/your code already knows that the file has changed and the update can be forced so that the 'unstaged' indication would never show itself. (note there is a separation / split of concerns, so the suggested solution changed)
The main use case here appears to be to shift data from one version control system to another, and retain a coarse mtime value held by the old system (probably as a text field, as git does not itself record it) when recreating revisions in the git system, and what is wanted is a "this is what it is, write it, blinkers-on" approach to copying the data across.
[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]
[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]
My memo: git hash-object -w git update-index --add --cacheinfo git write-tree git commit-tree
[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name >> of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read] My memo: git hash-object -w git update-index --add --cacheinfo git write-tree git commit-tree
From: Chapter 10 of Pro Git 2 Book
Thanks, that's useful.
Lets hope it helps the OP with the macro/script for the transfer from VSS... yeah.
Philip
I found a strange problem in msysGit, I am wondering if it's a bug. I already discussed it here: https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc You can download my Demo.zip attachment there. It seems not to happen in Git for Linux or OS-X. The problem occurs in msysGit 1.9.0 and 1.9.5 (running x86 version on a Windows 7 x64 system).
Essentially, the problem seems that Git for Windows assumes that two files are the same when both the timestamp and the file size match. Obviously the file contents is not inspected nor the hash recalculated.
As mentioned I created a minimal demo package to easily prove the issue. Simply download the zip file, put it in a clean directory and run the enclosed script from bash. Below is a transcript of what happens when I run the script in my situation.
The issue is a real show-stopper in my automated migration from Visual SourceSafe to Git (in a test transferring 20,000 source files the problem caused loss of roughly 0.1% of the files from the history), so I hope for a quick solution.