Open rastaman111 opened 1 week ago
Could you share a file that exhibits the problem?
Thank you for the test case. Was the ID3 tag in the file generated using LAME also?
The problem seems to be one of text encoding. While ID3v1 tags use the 8859-1 charset (although sometimes the machine's local encoding is used, such as Windows-1251 which appears to be the correct encoding for this particular ID3v1 tag), ID3v2 uses UTF-8. It seems the ID3v2 tag in this file is not encoded using UTF-8 but rather a different character set, most likely Windows-1251, the same as the ID3v1 tag.
Take the "word" Þðèé (should be Юрий) from the TPE
frame. The octets in UTF-8 and Windows-1251 have the following hex values:
UTF-8 | Hex value | Windows-1251 |
---|---|---|
Þ | 0xDE | Ю |
ð | 0xF0 | р |
è | 0xE8 | и |
é | 0xE9 | й |
The octet values interpreted using UTF-8 give Þðèé while using Windows-1251 give Юрий. So it seems that the text in both the ID3v1 and ID3v2 tags in this file is incorrectly encoded.
It's strange when Apple's native player easily recognizes text, just like Google Translate
I'll try to search for similar files and let you know the result
That is interesting. I will take a closer look at the file's tag to make sure it is being handled correctly. I've heard of charset detection for ID3v1 tags but for ID3v2 I don't think there should be any guessing involved.
Apple Music says it's version 3
I also ran it through several libraries and they all say that it is version 3
It is an ID3v2.3 tag. The TPE1
frame for example contains the following bytes:
Field | Hex Bytes | Meaning |
---|---|---|
Frame ID | 54 50 45 31 |
TPE1 |
Size | 00 00 00 0A |
10 |
Flags | 00 00 |
|
Text Encoding | 00 |
ISO 8859-1 |
Information | DE F0 E8 E9 20 CB EE E7 E0 |
It's possible that Music runs text reported as ISO 8859-1 through a character detection library. Based on the ID3v2 tag itself, TagLib (the metadata library used by SFBAudioEngine) is interpreting the data correctly.
It shouldn't be terribly hard to wrap uchardet to add the option for character set detection for ID3v1 or ID3v2 tags using ISO 8559-1 but I haven't investigated what it would entail.
I have the following question. How can I understand from the text what encoding it has and show the user the appropriate text?
Algorithms for character set detection are something I know little to nothing about. Perhaps an educated guess is made based on a frequency analysis of octets in the input?
For the file that you shared it should be possible to feed the C strings from the metadata to uchardet or a similar library and see what it comes back with, and then use iconv to convert to UTF-8.
This problem is not only with this file, I found a large number of such files, in the native application Files and Music the data is displayed as expected, but using the standard API does not lead to the desired result :(
Hello My file contains LAME3.93 encoding How can I get metadata through your library, without hieroglyphs
I get the following data: "Ñòî ÷àñîâ" and "Þðèé Ëîçà"