squell / id3

ID3 mass tagger
https://squell.github.io/id3
Other
116 stars 7 forks source link

id3 0.80 does not correctly handle non-English characters. #6

Open WonderRat opened 8 years ago

WonderRat commented 8 years ago

WXPSP3

The Russian text is written in ID3V1 are encoded in CP1251 but ID3 shows nonsense (i expect output in 866 - its russian OEM codepage):

>id3 -q "%t\n%a\n%l\n%c" russian1.mp3
AAAAAA?CEEEEIIII
?NOOOOO?OUUUUY??
aaaaaa?ceeeeiiii
?nooooo?ouuuuy?y??

I suspect problem in charconv.cpp in "template<> conv<>::data conv::decode(const char* s, size_t len)".


Strings from ID3V2 (russian text in unicode) printed in wrong codepage:

E:\>id3 -q "%t\n%a\n%l\n%c" russian2.mp3
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
рстуфхцчшщъыьэюя
ЁёЄєЇїЎў°∙·√№¤■ и╕

Its 1251 shown as 866.

If i change console codepage to 1251 and recode output from 1251 to 866, then text is correct:

>chcp 1251
>id3 -q "%t\n%a\n%l\n%c" russian2.mp3 | iconv -f CP1251 -t CP866

http://i.imgur.com/6Fe7LO2.png

АБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдежзийклмноп
рстуфхцчшщъыьэюяЁё

samples.zip

russian2_1251.txt, russian2_correct_866.txt - redirected output

squell commented 8 years ago

Have you tried switching your terminal to a Truetype font (e.g. Lucida Console, or Consolas)? That should correct the output in the ID3v2 case.

WonderRat commented 8 years ago

Information contained in ID3v1 is treated by id3 as encoded in ISO-8859-1 (a subset of cp1252).

ISO-8859-1 and cp1252 don't have cyrillic letters, so all russian strings in ID3v1 are written in cp1251 (yes, old mp3s, but they still come across). Why don't treat ID3v1 as ANSI? English users will not suffer from that - their ANSI code page will be 1252. WinAPI have alias CP_ACP (0x0) for that - real code page depends from locale settings. https://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx My players and tag editors treat them as ANSI.

Have you tried switching your terminal to a Truetype font

It works, but i like my raster font (modified 8x16, not that in the screenshot). I don't like Lucida Console, or Consolas as console font and don't need display all unicode symbols in console. I thought windows console programs should using OEM code page in first place (because it default) - like DOS programs. May be recoding option in commandline?

squell commented 8 years ago
  1. I do recognize that a switch to make id3 bug-compatible with regards to ID3v1 handling of other software might be useful, so I'll consider adding this; but probably only for reading/converting tags.
  2. I think the default nowadays is to write a "Unicode" application, which I'm working on (as soon as I have time available again), but those again require a Truetype-font.

In the end, I want id3 to work on Windows as it does on Linux/BSD:

C:\> id3 file.mp3
File: file.mp3
Metadata: ID3v2.3
Title: Something from Japan
Artist: 日本語

Until then, using the ANSI codepage makes more sense to me: if I redirect the output of id3 to a file, I expect to be able to read it using notepad. Commandline arguments are encoded in the ANSI codepage, as is the filesystem, etc. The OEM codepage to me is a relic from the Win3.x/Win9x days (which relied on DOS for its console); AFAICT it is only really necessary if you use the console full-screen.

So, I am going to finish the Unicode-build first; then we'll see how that functions in a console with a non-Truetype font. But supporting that is really low on my priority list.