Open hut8 opened 9 months ago
This looks relevant: https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/encoding.go#L162-L164
Also, oddly, in your code,
// bom is used in UTF-16 encoded Unicode with BOM.
// See https://en.wikipedia.org/wiki/Byte_order_mark.
var bom = []byte{0xFF, 0xFE}
This is the little-endian BOM. Not big-endian. But your program seems to write big-endian BOM (and big-endian text).
Alright I think I got it:
https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/text_frame.go#L24-L33
When I call tf.WriteTo()
, it calls bw.EncodeAndWriteText
. Follow along, it will add a single 0
-
func (bw *bufWriter) EncodeAndWriteText(src string, to Encoding) {
if bw.err != nil {
return
}
bw.err = encodeWriteText(bw, src, to)
}
encodeWriteText
does this:
// encodeWriteText encodes src from UTF-8 to "to" encoding and writes to bw.
func encodeWriteText(bw *bufWriter, src string, to Encoding) error {
if to.Equals(EncodingUTF8) {
bw.WriteString(src)
return nil
}
toXEncoding := resolveXEncoding(nil, to)
encoded, err := toXEncoding.NewEncoder().String(src)
if err != nil {
return err
}
bw.WriteString(encoded)
// Here we go! 💣
if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
bw.WriteByte(0)
}
return nil
}
So at this point, before encodeWriteText
is called, it doesn't have the termination characters added. Therefore that if
clause: !bytes.HasSuffix([]byte(encoded), []byte{0})
is true. But really that entire if
statement is wrong: a single 0 in UTF-16 LE at the end of any ASCII string that's been encoded as UTF-16 will be present. I think this should be removed entirely.
After encodeWriteText
is called, now we have a single 0
at the end, which will give us an odd number of bytes (invalid UCS-2/UTF-16). Then back to TextFrame.WriteTo()
:
func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
return useBufWriter(w, func(bw *bufWriter) {
bw.WriteByte(tf.Encoding.Key)
bw.EncodeAndWriteText(tf.Text, tf.Encoding) // <- this added a single 0
// https://github.com/bogem/id3v2/pull/52
// https://github.com/bogem/id3v2/pull/33
bw.Write(tf.Encoding.TerminationBytes) // <- now we have two more
})
}
There you go, three null bytes at the end. So there's a bug with UTF-16, and that's consistent with the tools I'm using to read it.
Thanks for a great library! I appreciate your hard work on it.
Like in my last issue (which is a distinct encoding issue), I'm running into more trouble with encoding.
Note that I'm calling SetVersion prior to SetDefaultEncoding because of #85
When I run
exiftool -v3 -l myfile.mp3
, it indicates this forTIT2
I see that the first 0x01 indicates that we're dealing with UCS-2 with a BOM. Let's disregard that. Then the BOM says it's big endian (fine). I edited out the 0x01 tag in a text editor then pasted it into an interactive python session:
Kinda hard to get to the standard at the moment (it's been like this for a while):
Here's a copy: https://web.archive.org/web/20190207033339/https://id3.org/id3v2.3.0#ID3v2_frame_overview
At least in 2.3.0, it does say that you're supposed to have a null terminator (so,
00 00
for UTF-16/UCS-2). I don't think that you can have a UCS-2 string with an odd number of characters. So it seems like you're appending an additional null character, and the length becomes not a multiple of 2, which makes it not legit UCS-2 or UTF-16.I'm pasting a screenshot rather than copying the text of the output of id3edit because I think the colors are cool:
So that thinks it's bad too. If I just remove the one extra null byte, python seems happy with it (and other tools):
I think this has something to do with other PRs that were meant to fix things for other encodings maybe in 2.4...