Closed wchristian closed 10 years ago
Info from #32:
This appears to become a problem (particularly with diffs and merges) when people are using the Github editor...?
It's a problem with anything. If you load the file as UTF8 in any software, the non-ASCII Latin-1 characters get combined with the following ASCII characters to form invalid UTF8 glyphs; if you load it as Latin-1, the UTF8 glyphs get split up into random Latin-1 character pairs.
It simply becomes more evident when you try to save the resulting mess into the file again.
Thanks, @wchristian!
This appears to be fixed by @bk2204's 6a83e35.
Some of your source data is encoded in UTF-8, some in Latin-1 (most noticable with german umlauts), however those emails don't seem to have headers to indicate either type. This leads to the csv/md files being a mix of both encodings.
The generation script needs to analze the input data and do a best-effort guess at what encoding it is in.