Open SonOfLilit opened 8 years ago
Thanks for reporting this @SonOfLilit. For daff.py
, a hack to make this work is to edit it by hand, replacing codecs.open(path,"r","utf-8")
with codecs.open(path,"r","iso-8859-1")
. With that change, I see a diff of:
@@,a,b
→, à,á→â
You may need to change more if you want the diff itself to be produced in the same encoding rather than utf-8.
How ideally should this work? A parameter specifying encoding? An attempt at autodetection?
param should be best, can't rely on what the file says as you can have latin1 in a utf8 file :-1:
I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.
Ideally there should be a cmd parameter because some poor people need to use utf16, which can't be made sense of without very special treatment.
But more importantly, default behavior should be to work on raw, undecoded bytes. As long as you never try to split cell contents (e.g. you must output "[abc->aBc]" and not "a[b->B]c" which might split a character in the middle in utf8), every other encoding I'm aware of would work just fine, including utf8, DOS codepages, ISO codepages and Windows codepages (I must admit I have no idea how pre-Unicode chinese/japanese codepages work, but they would probably be fine too).
On Sat, Sep 17, 2016, 12:40 AM Carl Sutton notifications@github.com wrote:
param should be best, can't rely on what the file says as you can have latin1 in a utf8 file 👎
I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paulfitz/daff/issues/71#issuecomment-247715709, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6fWvR_PGnOYspRD79VcT6HlpCUKtsks5qqwzmgaJpZM4J-are .
Ok, sounds like a parameter is important since there'll always be those who need it.
I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit?
As long as you're only touching characters that are ASCII (commas, double quotes, tabs, spaces) you should be fine with all the encodings I listed as not needing a parameter - the reason they don't is that they only differ in the non-ASCII code points.
On Tue, Sep 20, 2016, 12:17 AM Paul Fitzpatrick notifications@github.com wrote:
Ok, sounds like a parameter is important since there'll always be those who need it.
I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit https://github.com/SonOfLilit?
[1] https://github.com/chardet/chardet
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paulfitz/daff/issues/71#issuecomment-248129787, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6fUNNg9OSOnPonqPf1srU3Kx8svQcks5qrvvygaJpZM4J-are .
On Windows, tried both with cmd and a git bash shell:
csv_windows-1255.zip
of course, the reason I care is that excel works notoriously badly with utf8 csvs, so my git repository is full of csvs in other encodings, and I can't convert them as part of
git diff
...P.S. does anyone here know why git would accept my
.gitattributes
entry for*.tsv
but would silently ignore the identical entry for*.csv
?