Write TerminationBytes for id3.org spec compliance

ericmurmur commented 6 years ago

According to the spec http://id3.org/id3v2.4.0-structure, TerminationBytes has to be in the text frame. Some ID3 reader like Windows file explorer media file properties cannot handle such unicode text frames without proper TerminationBytes.

n10v commented 6 years ago

Hi! Thank you for your contribution. Can you please quote a text, where it's written? I can't find it. The only information I found is here:

     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding>

And there is nothing written about termination bytes.

n10v commented 6 years ago

And also this (http://id3.org/id3v2.4.0-frames):

All text information frames supports multiple strings, 
stored as a null separated list, where null is 
reperesented by the termination code for the character encoding.

ericmurmur commented 6 years ago

There are some argues from google search about this. However,

1. from 4 in http://id3.org/id3v2.4.0-structure, it states text string in frame with encoding byte prefixed and terminated with its corresponding termination bytes. It also show an empty string example.
1. From 4.2 in http://id3.org/id3v2.4.0-frames, it says multiple strings are stored as null separated list where null is represented by the termination code. It does not clearly state the status of the final termination bytes. ( so shall follow 1.?)
1. iTunes generate tags with termination bytes.
1. Windows File Explorer interprets wrong information if no such termination bytes in Unicode strings
1. A well known free mp3tag utility https://www.mp3tag.de/en/ since year 2003 generates tag with termination bytes

4.   ID3v2 frame overview
... ...

Frames that allow different types of text encoding contains a text
encoding description byte. Possible encodings:

$00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
$01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
strings in the same frame SHALL have the same byteorder.
Terminated with $00 00.
$02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
Terminated with $00 00.
$03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

Strings dependent on encoding are represented in frame descriptions
as <text string according to encoding>, or <full text string
according to encoding> if newlines are allowed. Any empty strings of
type $01 which are NULL-terminated may have the Unicode BOM followed
by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).

From 4.2 in http://id3.org/id3v2.4.0-frames

All text information frames supports multiple strings, 
stored as a null separated list, where null is 
represented by **the termination code** for the character encoding.

n10v commented 6 years ago

"it states text string in frame with encoding byte prefixed and terminated with its corresponding termination bytes"

It doesn't state it actually. It just shows the termination bytes for corresponding encodings, but nothing says that every encoded text should be terminated.

"It also show an empty string example"

It only shows an example for Unicode-16 with BOM and only which are NULL-terminated:

Any empty strings of type $01 which are NULL-terminated 
may have the Unicode BOM followed by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).

2.

From 4.2 in http://id3.org/id3v2.4.0-frames,
it says multiple strings are stored as null separated list where null 
is represented by the termination code. 
It does not clearly state the status of the final termination bytes.

Yes, there is nothing said about using one string in text frame, so if you use only one string in text frame with no termination, you don't violate the standard.

4 - I would like to add the writing the termination bytes for text frames, just for compatibility with Windows File Explorer. But please open the new issue for it with detailed description.

3, 5 - We should not rely on how other programs work, but on standards.

n10v commented 6 years ago

It's really strange situation about the termination bytes of text frames, but just take a look on other frames like USLT or APIC at http://id3.org/id3v2.4.0-frames. You can see the explicit indication about the termination bytes at the end:

... Description is a short description of the picture, represented as a terminated text string ...

     <Header for 'Attached picture', ID: "APIC">
     Text encoding      $xx
     MIME type          <text string> $00
     Picture type       $xx
     Description        <text string according to encoding> $00 (00) <----
     Picture data       <binary data>

... The 'Content descriptor' is a terminated string ...

     <Header for 'Unsynchronised lyrics/text transcription', ID: "USLT">
     Text encoding        $xx
     Language             $xx xx xx
     Content descriptor   <text string according to encoding> $00 (00) <----
     Lyrics/text          <full text string according to encoding>

In docs about text frames I don't see it:

     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding> <----

So I don't think WFE works according to standards.

ericmurmur commented 6 years ago

It's really not clear in spec. Hence compatibility will be good.

In addition, in sec 4, frame overview from http://id3.org/id3v2.4.0-structure, it lists

Terminated with $00

or

Terminated with $00 00

for possible encodings of text.

Doesn't it mean to prefix encoding byte and terminate the text with termination bytes?

Frames that allow different types of text encoding contains a text encoding description byte. Possible encodings:
$00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
$01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
strings in the same frame SHALL have the same byteorder.
Terminated with $00 00.
$02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
Terminated with $00 00.
$03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

n10v commented 6 years ago

Doesn't it mean to prefix encoding byte and terminate the text with termination bytes? I think, it doesn't mean so. It just means that if it should be terminated, it terminated with corresponding termination bytes.

n10v commented 6 years ago

@ericmurmur Please open a new issue with detailed description that WFE doesn't correctly read text frames, which are set with id3v2 library.

ericmurmur commented 6 years ago

It just means that if it should be terminated, it terminated with corresponding termination bytes.

Yes. So it means termination bytes shall be there for text string, doesn't it? And since Golang does not include termination bytes for string like c/c++, it shall be added, isn't it? In c/c++ library, termination bytes are already in their string buffer.

n10v commented 6 years ago

Yes. So it means termination bytes shall be there for text string, doesn't it? 
And since Golang does not include termination bytes for string like c/c++, 
it shall be added, isn't it? In c/c++ library, termination bytes are already in their string buffer.

I don't think so. There is no explicit information in standard that I must terminate strings in every text frame.

ericmurmur commented 6 years ago

Is the encoding byte added? Yes, right? In the standard, it lists all 4 encoding bytes. In every case, it has stated [Terminated with $00] or [Terminated with $00 00]. It is NOT optional. So at least, it shall be interpreted as for those started with encoding byte, it shall be terminated with $00 or $00 00. Or where do you suggest to implement the sentence [Terminated with $00] or [Terminated with $00 00]? In addition, according to 4.2 in http://id3.org/id3v2.4.0-frames,

Information <text string(s) according to encoding>

Doesn't it mean to follow the encoding descriptions in http://id3.org/id3v2.4.0-structure, which has stated [[Terminated with $00] or [Terminated with $00 00] in all encoding cases?

n10v commented 4 years ago

Hi @ericmurmur! I'm very sorry for so late answer, I had a lot of to do.

I think, you were right in this discussion. Many sites and apps terminate text frames with $00 (00), although it's not clearly stated in spec. If you'll update tests, I would like to merge this PR :)

Just replace following tests: parse_test.go

// https://github.com/bogem/id3v2/issues/13.
// https://github.com/bogem/id3v2/commit/3845103da5b1698289b82a90f5d2559b770bd996
func TestParseV3UnsafeSize(t *testing.T) {
    t.Parallel()

    buf := new(bytes.Buffer)
    title := strings.Repeat("A", 254)

    tag := NewEmptyTag()
    tag.SetVersion(3)
    tag.SetTitle(title)
    if _, err := tag.WriteTo(buf); err != nil {
        t.Fatalf("Error while writing tag: %v", err)
    }

    titleFrameHeader := buf.Bytes()[tagHeaderSize : tagHeaderSize+frameHeaderSize]
    expected := []byte{0, 0, 1, 0}
    got := titleFrameHeader[4:8]
    if !bytes.Equal(got, expected) {
        t.Fatalf("Expected %v, got %v", expected, got)
    }

    parsedTag, err := ParseReader(buf, Options{Parse: true})
    if err != nil {
        t.Fatalf("Error while parsing tag: %v", err)
    }
    if parsedTag.Title() != title {
        t.Fatalf("Titles are not equal: len(parsedTag.Title()) == %v, len(title) == %v", len(parsedTag.Title()), len(title))
    }
}

tag_test.go

    framesSize    = 211948

n10v commented 4 years ago

And also please push new test.mp3 :)

n10v commented 4 years ago

Closing in favour of #52

n10v / id3v2

Write TerminationBytes for id3.org spec compliance #33