larsbs / id3v2lib

id3v2lib is a library written in C to read and edit id3 tags from mp3 files.
BSD 2-Clause "Simplified" License
128 stars 44 forks source link

Unicode string termination issues #54

Open daleonov opened 6 months ago

daleonov commented 6 months ago

When I read the tags that have unicode text in them, in most cases there's some garbage before string termination character.

Example:

Byte sequence I get:
ff fe 32 00 30 00 31 00 35 00 31 0c f7 7f 00 00

"Corrected" byte order before I decode it from UTF-16:
fe ff 00 32 00 30 00 31 00 35 0c 31 7f f7 00 00

What I get:
2015ఱ翷

So, "fe ff" is a BOM, "00 32 00 30 00 31 00 35" is "2015". Then there are four garbage bytes "0c 31 7f f7" followed by string termination. Those garbage bytes are different every time I run the app, so looks like RAM garbage rather than something existing in the mp3 file. Also, those files' metadata displays correctly in other apps.

Here's what different "size" values say.

frame->header->size: 17
textFrame->data->size: 16
ID3v2_strlen: 14
ID3v2_strlent: 16

Correct length, if we count BOM and termination bytes should be 8 + 2 for BOM + 2 for terminations, equals 12. So all those values above are off.

Here's a bit of code I'm doing:

// I'm iterating through all the tags.
// At this point of code I'm sure I'm looking at Text frame, so I just cast it below.

juce::String value;  // I'm using this type externally, don't worry too much about it
ID3v2_TextFrame* textFrame = (ID3v2_TextFrame*)frame;
const char encoding = textFrame->data->encoding;

/*
I also try frame->header->size and textFrame->size.
All three of them are different values, but the actual readable section of the text is shorter that them.
*/
const int textLengthBytes = ID3v2_strlent (textFrame->data->text);

if (encoding == ID3v2_ENCODING_ISO)
{
    // ISO encoding, keeping it as is
    value = juce::String (textFrame->data->text, textLengthBytes);
}
else if (encoding == ID3v2_ENCODING_UNICODE)
{
    // UNICODE - this is where the trouble is

    DBG ("frame->header->size: " << frame->header->size);
    DBG ("textFrame->size: " << textFrame->size);

    // I'm printing those bytes here to see what's under the hood
    DBG ("====");
    for (int i = 0; i < textLengthBytes; ++i)
        DBG (static_cast<unsigned char> (textFrame->data->text[i]));
    DBG ("====");

    // This but makes sure bytes are in correct order and interprets them as UTF16
    // I've used other means of converting, same result, so I don't think the issue is here
    const auto* textAsUnicode = reinterpret_cast<juce::CharPointer_UTF16::CharType*> (textFrame->data->text);
    juce::CharPointer_UTF16 textAsUnicodeCharPointer (textAsUnicode);
    value = juce::String (textAsUnicodeCharPointer);
}

Also worth mentioning that I build the project using VS v143, but I've compiled your library using LLVM (clang) since it has Variable length arrays that aren't supported by VS. Either there's some incompatibility between binaries, or I'm missing something else there.

If you think my compiler setup is to blame, I can volunteer to fix #48 and replace those VLA's by malloc's, so I can build everything via VS toolset.

daleonov commented 6 months ago

Update: apparently VLA's weren't the problem. Fixed both issues in my recent PR. Tested via reading a bunch of mp3 files with unicode tags. Haven't tested anything that does writing/removing though.