UTF-16 Encoding Issue - Githubissues

hut8 commented 9 months ago

Thanks for a great library! I appreciate your hard work on it.

Like in my last issue (which is a distinct encoding issue), I'm running into more trouble with encoding.

Note that I'm calling SetVersion prior to SetDefaultEncoding because of #85

tags.SetVersion(3)                           // id3v2.3.0
tags.SetDefaultEncoding(id3v2.EncodingUTF16) // UTf-8 would be cool, but 2.3 doesn't support it
tags.SetTitle("Мир! Вашему! Дому! (K-DEF Remix)")

When I run exiftool -v3 -l myfile.mp3, it indicates this for TIT2

  |     009c: 01 fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04 [......8.@.!. ...]
  |     00ac: 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04 [0.H.5.<.C.!. ...]
  |     00bc: 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00 [>.<.C.!. .(.K.-.]
  |     00cc: 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00 [D.E.F. .R.e.m.i.]
  |     00dc: 78 00 29 00 00 00                               [x.)...]

I see that the first 0x01 indicates that we're dealing with UCS-2 with a BOM. Let's disregard that. Then the BOM says it's big endian (fine). I edited out the 0x01 tag in a text editor then pasted it into an interactive python session:

>>> hexstr="""
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 68: truncated data
>>> len(raw)
69

Kinda hard to get to the standard at the moment (it's been like this for a while):

Here's a copy: https://web.archive.org/web/20190207033339/https://id3.org/id3v2.3.0#ID3v2_frame_overview

At least in 2.3.0, it does say that you're supposed to have a null terminator (so, 00 00 for UTF-16/UCS-2). I don't think that you can have a UCS-2 string with an odd number of characters. So it seems like you're appending an additional null character, and the length becomes not a multiple of 2, which makes it not legit UCS-2 or UTF-16.

I'm pasting a screenshot rather than copying the text of the output of id3edit because I think the colors are cool:

So that thinks it's bad too. If I just remove the one extra null byte, python seems happy with it (and other tools):

>>> hexstr = """
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
'Мир! Вашему! Дому! (K-DEF Remix)\x00'

I think this has something to do with other PRs that were meant to fix things for other encodings maybe in 2.4...

hut8 commented 9 months ago

This looks relevant: https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/encoding.go#L162-L164

Also, oddly, in your code,

// bom is used in UTF-16 encoded Unicode with BOM.
// See https://en.wikipedia.org/wiki/Byte_order_mark.
var bom = []byte{0xFF, 0xFE}

This is the little-endian BOM. Not big-endian. But your program seems to write big-endian BOM (and big-endian text).

hut8 commented 9 months ago

Alright I think I got it:

https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/text_frame.go#L24-L33

When I call tf.WriteTo(), it calls bw.EncodeAndWriteText. Follow along, it will add a single 0 -

func (bw *bufWriter) EncodeAndWriteText(src string, to Encoding) {
    if bw.err != nil {
        return
    }

    bw.err = encodeWriteText(bw, src, to)
}

encodeWriteText does this:

// encodeWriteText encodes src from UTF-8 to "to" encoding and writes to bw.
func encodeWriteText(bw *bufWriter, src string, to Encoding) error {
    if to.Equals(EncodingUTF8) {
        bw.WriteString(src)
        return nil
    }

    toXEncoding := resolveXEncoding(nil, to)
    encoded, err := toXEncoding.NewEncoder().String(src)
    if err != nil {
        return err
    }

    bw.WriteString(encoded)

        // Here we go! 💣
    if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
        bw.WriteByte(0)
    }

    return nil
}

So at this point, before encodeWriteText is called, it doesn't have the termination characters added. Therefore that if clause: !bytes.HasSuffix([]byte(encoded), []byte{0}) is true. But really that entire if statement is wrong: a single 0 in UTF-16 LE at the end of any ASCII string that's been encoded as UTF-16 will be present. I think this should be removed entirely.

After encodeWriteText is called, now we have a single 0 at the end, which will give us an odd number of bytes (invalid UCS-2/UTF-16). Then back to TextFrame.WriteTo():

func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
    return useBufWriter(w, func(bw *bufWriter) {
        bw.WriteByte(tf.Encoding.Key)
        bw.EncodeAndWriteText(tf.Text, tf.Encoding) // <- this added a single 0

        // https://github.com/bogem/id3v2/pull/52
        // https://github.com/bogem/id3v2/pull/33
        bw.Write(tf.Encoding.TerminationBytes) // <- now we have two more
    })
}

There you go, three null bytes at the end. So there's a bug with UTF-16, and that's consistent with the tools I'm using to read it.

n10v / id3v2

UTF-16 Encoding Issue #86