Garbled textual metadata

sbooth / SFBAudioEngine

A powerhouse of audio functionality for macOS and iOS

https://sbooth.github.io/SFBAudioEngine/

MIT License

560 stars 87 forks source link

Garbled textual metadata #365

Open rastaman111 opened 1 week ago

rastaman111 commented 1 week ago

Hello My file contains LAME3.93 encoding How can I get metadata through your library, without hieroglyphs

I get the following data: "Ñòî ÷àñîâ" and "Þðèé Ëîçà"

sbooth commented 1 week ago

Could you share a file that exhibits the problem?

rastaman111 commented 1 week ago

Юрий Лоза-Сто часов.mp3.zip

sbooth commented 6 days ago

Thank you for the test case. Was the ID3 tag in the file generated using LAME also?

The problem seems to be one of text encoding. While ID3v1 tags use the 8859-1 charset (although sometimes the machine's local encoding is used, such as Windows-1251 which appears to be the correct encoding for this particular ID3v1 tag), ID3v2 uses UTF-8. It seems the ID3v2 tag in this file is not encoded using UTF-8 but rather a different character set, most likely Windows-1251, the same as the ID3v1 tag.

Take the "word" Þðèé (should be Юрий) from the TPE frame. The octets in UTF-8 and Windows-1251 have the following hex values:

UTF-8	Hex value	Windows-1251
Þ	0xDE	Ю
ð	0xF0	р
è	0xE8	и
é	0xE9	й

The octet values interpreted using UTF-8 give Þðèé while using Windows-1251 give Юрий. So it seems that the text in both the ID3v1 and ID3v2 tags in this file is incorrectly encoded.

rastaman111 commented 6 days ago

It's strange when Apple's native player easily recognizes text, just like Google Translate

I'll try to search for similar files and let you know the result

sbooth commented 6 days ago

That is interesting. I will take a closer look at the file's tag to make sure it is being handled correctly. I've heard of charset detection for ID3v1 tags but for ID3v2 I don't think there should be any guessing involved.

rastaman111 commented 6 days ago

Apple Music says it's version 3

I also ran it through several libraries and they all say that it is version 3

sbooth commented 6 days ago

It is an ID3v2.3 tag. The TPE1 frame for example contains the following bytes:

Field	Hex Bytes	Meaning
Frame ID	`54 50 45 31`	`TPE1`
Size	`00 00 00 0A`	10
Flags	`00 00`
Text Encoding	`00`	ISO 8859-1
Information	`DE F0 E8 E9 20 CB EE E7 E0`

It's possible that Music runs text reported as ISO 8859-1 through a character detection library. Based on the ID3v2 tag itself, TagLib (the metadata library used by SFBAudioEngine) is interpreting the data correctly.

It shouldn't be terribly hard to wrap uchardet to add the option for character set detection for ID3v1 or ID3v2 tags using ISO 8559-1 but I haven't investigated what it would entail.

rastaman111 commented 3 days ago

I have the following question. How can I understand from the text what encoding it has and show the user the appropriate text?

sbooth commented 3 days ago

Algorithms for character set detection are something I know little to nothing about. Perhaps an educated guess is made based on a frequency analysis of octets in the input?

For the file that you shared it should be possible to feed the C strings from the metadata to uchardet or a similar library and see what it comes back with, and then use iconv to convert to UTF-8.

rastaman111 commented 2 days ago

This problem is not only with this file, I found a large number of such files, in the native application Files and Music the data is displayed as expected, but using the standard API does not lead to the desired result :(