File paths in UTF-8 - Githubissues

robotroll commented 8 years ago

Hi Roger,

I've been using this tool to great success to convert our old cvsnt repository to git. The sole problem at the moment are files with special characters e.g. öüä in their file name. Those are not correctly handled and produce files in the git repository with broken names. I saw the TODO comments in the ExportWriter.scala which should introduce utf-8 encoding for the paths. I tried to implement it myself with no success. Any chance you can have a look into it?

Regards, Robin

rvaughn commented 8 years ago

Sure, I will take a look at that this week.

robotroll commented 8 years ago

Furthermore, I think i pinpointed my problem to the deltas function in the FileParser. It is not handling the conversion correctly. It produces broken strings for the file names that it extracts out of the ,v file. Maybe there is a problem in your BufferedRandomAccessFile.scala. I just don't understand the byte magic that is happening there.

rvaughn commented 8 years ago

Hi Robin,

That's probably correct. FileParser deals in raw bytes and doesn't recognize extended character sets like UTF-8 by itself, so the filename strings are probably getting miscoded. FWIW, BufferedRandomAccessFile is simply a replacement file reader. I couldn't use Java's character streams because they might unintentionally recognize byte data as unicode sequences, and Java's byte stream classes were just too slow, so I wrote my own.

robotroll commented 8 years ago

Hi Robert,

thanks for the clarification. I started digging again and think the problem is in the conversion from bytes to string. There are several functions doing this in the Fileparser class. They all get one byte and convert it using the .toChar function. This is not correct when a utf-8 character needs to be converted since it occupies more than one byte. Therefore all non ascii characters get converted into two separat symbols. I'll try to implement something to wrap this conversion and make it distinguish between ascii and utf-8 byte sequences.

Until then, merry Christmas and a happy new year for you and your family. Robin

rvaughn commented 8 years ago

The per-byte handling is intentional and necessary in most cases. However, toChar will indeed screw up multi-byte sequences, you are correct about that. The thing is, there are very few cases where it matters. Most of the CVS archive data is guaranteed to be base ASCII, so it doesn't matter for those. Filenames were originally supposed to be base ASCII as well, but I think time and the CVSNT team kind of bent those rules.

The proper way to handle it is to capture the bytes in a byte array instead of a string, and then convert the array to a string when complete. The problem with this approach is that we don't always know the input encoding, and have to tell Java explicitly which one to use. We can simply assume UTF-8 for your case though.

See the implementation for comment(), especially the rawLines() and decode() functions. These do the proper handling for UTF-8 comments. The fix in this case is to rewrite the string() and value() functions similarly, so I'll do that. The other uses of toChar should be safe.

robotroll commented 8 years ago

Working like a charm. Thanks!

rvaughn / cvs-fast-export

File paths in UTF-8 #7