Mismatched EOL content to svn:eol-style attribute detection and in-conversion fix

ned14 commented 11 years ago

Note: ignore my deletion of git_push.cmake, this was to stop the script trying to upload the generated repos.

Note that conversion performance is significantly impaired by this improvement. I have cached precompiled regexs converted from the globs parsed from the gitattributes, but in the end we're adding ~90 regexs per file converted and it is not swift. I suppose one could fold all svneol=native attributes into a single regex which orrs the comparisons, but I am unconvinced it would be significantly faster.

Our next problem is that early Boost commits did not set svn:eol-style despite adding text which could contain any variant of CRLF or LF, including both EOL types in the same file. I rewrote much of the original C++ to use C's memchr() in order to try and improve on performance here, but ultimately we end up having to run the EOL converting pass for all text without a svn:eol-style specified just in case. Again, this is not swift as it involves having to append chunks into a realloc()'d block as git's fast-import needs to know the exact size of the file being added before being added, and one cannot precalculate this.

I added some warnings printing which files are incorrect EOL repaired. These include all the files submitted as faulty in the original bug report, plus many others we didn't know about before. They are listed at https://gist.github.com/ned14/6420324 for reference. Note that later commits may have fixed up bad EOL in earlier commits, either way we fixup during commit replay so all old commits should be now git correct.

Let me know if you have any questions.

Niall

dabrahams commented 11 years ago

One question and one request: Q1: do you know the source of the slowdown, or do you just have to guess? R1: Please rewrite your branch without the deletion of git_push.cmake so I can just merge the pull request directly And, Thank You!

ned14 commented 11 years ago

R1: Done. I would check the git repos which are output to make sure they are sane using diff or something. I believe they will be, but much of the fast import implementation logic has been changed with this patch. Better be safe than sorry.

Q1: If you disable the regex matching, performance seems to return, but then without regex matching it will never EOL convert either. I have the general suspicion that libstdc++'s regex is generally significantly slower than other STLs e.g. Dinkumware's or libc++. I tried swapping in Boost's regex, but didn't see a huge difference.

You could refactor the conversion to work in batches which lets the regex code stay in L1. Or much easier you could use OpenMP to parallelise writing as git fast-import is thread safe I believe.

dabrahams commented 11 years ago

R1: That's not what I asked for. It's OK; I can take care of it myself

Q1: But surely this isn't bound by the cost of regex matching. It's almost inconceivable to me that even taking an extra (read) pass through the text of most files causes much of a slowdown. I guess I'll have a quick look through your changes.

ned14 commented 11 years ago

R1: Git history preserves when SHAs match and no garbage collect has happened, so no history is lost as you can see at https://github.com/ned14/Boost2Git/commits/bad_eol_fix/git_push.cmake. I guess it does gain two commits from me, fair enough.

Q1: Do bear in mind my benchmarks are entirely based on Niall's amount of web browsing doable while it does a conversion i.e. somewhat subjective.

dabrahams commented 11 years ago

Your code doesn't work for me; see https://gist.github.com/dabrahams/6433187
I fixed a lot of things in your code that would prevent me from releasing it as is. Please consider doing further work atop https://github.com/ryppl/Boost2Git/tree/ned14-bad_eol_fix. It should look very familiar.

ned14 commented 11 years ago

On item 2, no problem, though my debug printing may well be very useful for diagnosing item 1. I will port any fixes to a branch based on your changes.

On item 1, I am now feeling very glad I added that sanity detection of ascii 0 in supposed text. The problem is that this does not occur on my svnsynced clone of the boost subversion repository, so I will not be able to repeat the problem.

Can I suggest this: can you tar up your copy of the boost subversion repo and put it somewhere like Google Drive so I can grab a copy? svnsync is known to not make exact copies, so I fear it may be that.

dabrahams commented 11 years ago

on Wed Sep 04 2013, Niall Douglas wrote:

On item 2, no problem, though my debug printing may well be very useful for diagnosing item 1.

If the diagnostics are generally useful, you should use the program's logging facilities (search for uses of "Log::").

I will port any fixes to a branch based on your changes.

Thanks.

On item 1, I am now feeling very glad I added that sanity detection of ascii 0 in supposed text. The problem is that this does not occur on my svnsynced clone of the boost subversion repository, so I will not be able to repeat the problem.

Can I suggest this: can you tar up your copy of the boost subversion repo and put it somewhere like Google Drive so I can grab a copy? svnsync is known to not make exact copies, so I fear it may be that.

OK, working on it, using 7z

dabrahams commented 11 years ago

Uploading 842M to Google Drive is gonna take a few minutes…

purpleKarrot commented 11 years ago

svnsync is known to not make exact copies, so I fear it may be that.

Can you elaborate on that? Doesn't the jenkins build use a mirror of the Boost repository that was created with svnsync?

ned14 commented 11 years ago

@dabrahams Thanks for the image. I'll set it downloading here, but as today is my first, uninterrupted day in over a week, I really want to get moving on Boost.AFIO today. We have a deadline soon! I'm also half way through a substantial refactor to bypass Win32 and use the NT kernel directly, and rebooting into Linux to look into Boost2Git would lose me state. Once the current refactor is done to the point I can commit something compiling I'll return to fixing Boost2Git. Hope this is okay.

@purpleKarrot I read that somewhere which I can't refind now. It said that the only way to make an absolutely perfect copy was svnadmin dump (or svnrdump from 1.7). I don't think any differences are material, but certainly svnsync effectively replays commits and therefore applies the commit logic from new subversion rather than the older subversion which did the commit back in history.

ned14 commented 11 years ago

Good news: your boost-trunk dump replicates the same error you had here. I'll get to work so.

ned14 commented 11 years ago

Ok, the problem is indeed XML committed as UTF-16. If you examine https://gist.github.com/ned14/6467645 where I added a print of every character in the offending file, but with buffer[x]<32 ? '?', it's extremely obviously UTF-16.

git is well known to not like text being in any format but UTF-8, and if you really must store non-UTF-8 text in git, one does so as a binary blob.

How would you like me to proceed? An automatic downconversion of UTF-16 to UTF-8 for all text looks sensible.

Niall

ghost commented 11 years ago

I think we need to keep the bytes intact. They weren't breaking anything before we tried to fix line endings, so I seriously doubt Git is choking on otherwise-correct data. Is your code unconditionally treating everything as UTF8? You could detect invalid UTF8 and simply leave the bytes alone in that case.

Sent from my moss-covered three-handled family gradunza

On Sep 6, 2013, at 11:14 AM, Niall Douglas notifications@github.com wrote:

Ok, the problem is indeed XML committed as UTF-16. If you examine https://gist.github.com/ned14/6467645 where I added a print of every character in the offending file, but with buffer[x]<32 ? '?', it's extremely obviously UTF-16.

git is well known to not like text being in any format but UTF-8, and if you really must store non-UTF-8 text in git, one does so as a binary blob.

How would you like me to proceed? An automatic downconversion of UTF-16 to UTF-8 for all text looks sensible.

Niall

— Reply to this email directly or view it on GitHub.

ned14 commented 11 years ago

My code doesn't understand text, it simply removes all ASCII 13s from anything which .gitattributes says ought to be text. As that had obvious potential to silently corrupt data, one of the sanity checks I added was for UTF-16 which correctly tripped.

You're right that no one has yet complained about bad EOLs in UTF-16. I suspect it only affects early Boost commits anyway, but be aware that probably 99% of UTF-16 text was generated on Windows and therefore will be CRLF. As git detects binaries via scanning the first 8Kb for zeros, it ought to ignore .gitattributes and treat UTF-16 text as binary.

I'll change my sanity check for zeros response to disable EOL changes. That should skip processing UTF-16 text. I'll be in touch soon.

Niall

ryppl / Boost2Git

Mismatched EOL content to svn:eol-style attribute detection and in-conversion fix #41