Open krisfed opened 1 year ago
I looked at the resulting MP3 file in a hex editor, and it seems like libsndfile writes MP3 metadata as ID3v1. Is that right? From my limited understanding, ID3v1 either doesn't really support UTF-8 or there is no way to indicate which encoding is used in a ID3v1 tag... Please do correct me if this is wrong.
Is there any way to write ID3v2 (preferably ID3v2.4, which seem to explicitly support UTF-8) to MP3 files with libsndfile? Would appreciate any info!
Hi @krisfed . AFAIK libsndfile writes and reads ID3v1 and ID3v2, no ID3v4 yet:
@arthurt ?
Thank you @evpobr ! Can writing of the ID3v2 tags be achieved through sf_set_string
? I see lame_get_id3v2_tag
is used to determine if it is a ID3v2 header, but I am not sure how to control that from the libsndfile's public interface... Would love an example or some pointers!
I did a bit more digging, and I understand that there are some conventions about which encoding is used by which ID3 tag version. It is too bad that https://id3.org/ seems to be down at the moment, but this is what I was able to gather (do correct me if something below is off):
ID3v1 - either doesn't support UTF-8 at all, or you can use any encoding you like, but don't have any way to signal to decoder how to interpret it (no "encoding byte"). The assumed encoding might be the (ASCII-based) ISO-8859-1 (?)
Refs:
ID3v2.2/ID3v2.3 - added Unicode support (16-bit unicode 2.0, UCS-2) in addition to ISO-8859-1. Need to specify which is used by the encoding byte (0x00 for ISO-8859-1 and 0x01 for UTF-16 encoded Unicode). Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.
Refs:
ID3v2.4 - Textual data to be encoded in UTF-8 rather than UTF-16. Added two more encoding byte values: 0x02 – UTF-16BE encoded Unicode without BOM 0x03 – UTF-8 encoded Unicode
Refs:
The above might explain why reading UTF-8 data from ID3v1 does not always work in all software (if it expects/assumes ISO-8859-1). It would be nice if libsndfile supported ID3v2.4 (?), which should explicitly support UTF-8. Using ID3v2.2/ID3v2.3 might be the next best option (?), but it still doesn't explicitly support UTF-8. Should UTF-16 be used instead for writing ID3v2.2/ID3v2.3 tags?
I am still unclear on how to make libsndfile use ID3v2.2/ID3v2.3 instead of ID3v1. Should the string passed into sf_set_string
contain the encoding byte or something else that would indicate it is for writing into a ID3v2.2/ID3v2.3 tag? I see a comment here https://askubuntu.com/questions/622132/how-to-write-album-artist-id3-tag-using-lame that the command-line LAME tool at least at some point used the size of the data to determine whether to use ID3v2 (i.e. use ID3v2 if data won't fit into ID3v1) if specific '--add-id3v2' or '--id3v2-only' options were not used. Does libsndfile use a similar kind of logic?
Would appreciate any clarifications, corrections, or thoughts. Thanks!
Libsndfile supports both ID3 v1 and ID3 v2, using both libmp3lame and libmpg123 to do the actual work.
https://github.com/libsndfile/libsndfile/blob/master/src/mpeg_decode.c#L462
When reading a file, v2 tags take precedence over v1 tags. libmpg123 does the right thing, and all strings it returns are UTF-8, regardless of the actual encoding of the tag, so all strings returned by sf_get_string()
will also be UTF-8.
For encode, libsndfile works basically like lame --add-id3v2
and adds both a ID3v2 header and ID3v1 trailer.
However, your questions caused me to investigate, and currently there is a bug with the text encoding when setting ID3 tags. The current state of the code is that the values passed to sf_set_string()
are passed to lame as latin1 (aka ISO-8859-1, aka CP-1252). If they happen to be UTF-8, they will be written into the file incorrectly as "latin1" for both ID3v1 and ID3v2. So that's a bug.
In general, the encoding of the text passed through sf_get_string()
/sf_set_string()
isn't really specified. For sane formats like Ogg and Flac, it's UTF-8 and we can relax. For WAV and WAV like, literally nobody knows. It isn't in the spec, some software uses latin1, others use UTF-8. All we know is that it isn't wide characters. I guess libsndfile is doing the only thing it can, and just reporting the stream of bytes that it sees.
Thank you for the info, @arthurt ! Really appreciate your help with this.
To clarify, in our workflow we do not use libsndfile for reading, only for writing MP3 files.
Sounds like the bug with writing UTF-8 metadata as latin1 could explain why we are seeing other software (like Apple Music and Windows Media Player) not reading MP3 metadata correctly from files written by libsndfile.
It is useful to know that libsndfile is supposed to write both a ID3v2 header and ID3v1 trailer. The odd thing is that I don't see ID3v2 tags in the MP3 files written by libsndfile, only the ID3v1 tags. That's why I was asking how to make libsndfile use ID3v2 tags. From my (limited) understanding, ID3v1 tags begin with "TAG" and are at the end of the file. The ID3v2 usually occur at the start of the file and should start with "ID3".
This is what I am seeing when I look at the MP3 file written by libsndfile in a hex editor.
End of the file seem to contain ID3v1 tag as expected:
But start of the file does not seem to have a ID3v2 as far as I can tell:
Compare that with the start of a file where metadata was written by Apple Music, where I can see ID3v2 tag starting with "ID3":
Similarly when I look at those two files with Mp3tag (https://www.mp3tag.de/](https://www.mp3tag.de/ ), it shows the tag version is ID3v1 for file written with libsndfile, and ID3v2.2 for file with metadata written by Apple Music.
Would love to hear your thoughts on this! Thanks again for all the help.
Turns out I was wrong on that point. I wrote the code, but it was a long time ago now :laughing:
libsndfile does not work like lame --add-id3v2
, and liblame will only output an id3v2 header in circumstances where tag field sizes are exceeded,or non-latin1 text is added.
If you exceed an ID3v1 field length, you will see an ID3v2 header, which is latin1 encoded:
$ head -c 256 roygbiv-convert.mp3 | hexdump -C
00000000 49 44 33 03 00 00 00 00 01 10 54 53 53 45 00 00 |ID3.......TSSE..|
00000010 00 2f 00 00 00 4c 41 4d 45 20 36 34 62 69 74 73 |./...LAME 64bits|
00000020 20 76 65 72 73 69 6f 6e 20 33 2e 31 30 30 20 28 | version 3.100 (|
00000030 68 74 74 70 3a 2f 2f 6c 61 6d 65 2e 73 66 2e 6e |http://lame.sf.n|
00000040 65 74 29 54 49 54 32 00 00 00 08 00 00 00 52 6f |et)TIT2.......Ro|
00000050 79 67 62 69 76 54 50 45 31 00 00 00 11 00 00 00 |ygbivTPE1.......|
00000060 42 6f 61 72 64 73 20 6f 66 20 43 61 6e 61 64 61 |Boards of Canada|
00000070 54 41 4c 42 00 00 00 20 00 00 00 4d 75 73 69 63 |TALB... ...Music|
00000080 20 48 61 73 20 74 68 65 20 52 69 67 68 74 20 74 | Has the Right t|
00000090 6f 20 43 68 69 6c 64 72 65 6e ff fb 90 64 00 00 |o Children...d..|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 58 69 |..............Xi|
000000c0 6e 67 00 00 00 0f 00 00 16 aa 00 30 95 21 00 03 |ng.........0.!..|
000000d0 06 08 0c 0f 11 15 17 1a 1d 1f 23 26 29 2c 2e 32 |..........#&),.2|
000000e0 35 37 3a 3e 40 43 45 49 4c 4e 52 54 56 59 5b 5e |57:>@CEILNRTVY[^|
000000f0 61 63 66 68 6b 6e 70 73 75 78 7a 7d 7f 82 84 87 |acfhknpsuxz}....|
$ tail -c 128 roygbiv-convert.mp3 | hexdump -C
00000000 54 41 47 52 6f 79 67 62 69 76 00 00 00 00 00 00 |TAGRoygbiv......|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 42 6f 61 72 64 73 20 6f 66 20 43 61 6e 61 64 |.Boards of Canad|
00000030 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4d |a..............M|
00000040 75 73 69 63 20 48 61 73 20 74 68 65 20 52 69 67 |usic Has the Rig|
00000050 68 74 20 74 6f 20 43 68 69 6c 64 72 65 00 00 00 |ht to Childre...|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff |................|
00000080
$ mpg123-id3dump roygbiv-convert.mp3
FILE: roygbiv-convert.mp3
==== ID3v1 ====
Title: Roygbiv
Artist: Boards of Canada
Album: Music Has the Right to Childre
Year:
Comment:
Genre: 255
==== ID3v2 ====
Title: Roygbiv
Artist: Boards of Canada
Album: Music Has the Right to Children
==== ID3v2 Raw frames ====
TSSE language()
LAME 64bits version 3.100 (http://lame.sf.net)
TIT2 language()
Roygbiv
TPE1 language()
Boards of Canada
TALB language()
Music Has the Right to Children
So, this leaves a question of what to do about the situation. The current solution is broken. Any text passed to sf_set_string()
will be interpreted as if it was latin-1 encoded.
The constraints on the solution are that libsndfile's string API is a lowest-common-denominator approach for multiple formats and doesn't expose all the features of ID3v2.
So it doesn't make sense to pull in a proper ID3 library like taglib, or id3tag or libid3tag (why is the MP3/ID3 software space so fragmented!?). If a user needs all the features of ID3, they should link and use such a library directly.
So a solution would be to just support a subset of id3 as exposed by sf_get/set_string()
, but then we have to deal with switching around text encodings. I would love to import iconv
and just be done with it, but that is a bit heavy-weight, and not supported by Windows.
Libmpg123 has some support for UTF-8 -> UTF-16, so we could leverage that and only write UTF-16 ID3v2 tags, never writing ID3v1 tags.
Thanks a lot for this explanation, @arthurt !
libsndfile does not work like
lame --add-id3v2
, and liblame will only output an id3v2 header in circumstances where tag field sizes are exceeded, or non-latin1 text is added.Any text passed to
sf_set_string()
will be interpreted as if it was latin-1 encoded.
So currently libsndfile doesn't write ID3v2 tags even when there is non-ASCII data present because of this issue of interpreting everything as latin1? Is this latin1 issue coming from libsndfile's or lame's side?
Libmpg123 has some support for UTF-8 -> UTF-16, so we could leverage that and only write UTF-16 ID3v2 tags, never writing ID3v1 tags.
Just my two cents... I think it would be unfortunate to use mpg123 for writing workflows when so far it was only used for reading. And going with UTF-16 is also a bit out of model if libsndfile tries to use UTF-8 everywhere else.. Is there any chance libsndfile could write ID3v2.4 UTF-8 tags (by specifying the 0x03 encoding byte)? I would think that might be the ideal option (as a default, without exposing any additional ID3 features), but I am not sure lame supports ID3v2.4...
Thanks again for all your work on this! <3
So currently libsndfile doesn't write ID3v2 tags even when there is non-ASCII data present because of this issue of interpreting everything as latin1? Is this latin1 issue coming from libsndfile's or lame's side?
It's a libsndfile bug. The libmp3lame api expects latin1 input arguments on the functions libsndfile is using. There are other functions that accept UTF-16, but only set ID3v2 fields.
Just my two cents... I think it would be unfortunate to use mpg123 for writing workflows when so far it was only used for reading.
I agree, I would like to keep the two separated. However, there already is some overlap of using both lame and mpg123 in the ID3 handling, as neither library is fully-featured.
And going with UTF-16 is also a bit out of model if libsndfile tries to use UTF-8 everywhere else..
What I meant by using UTF-16, is that UTF-16 could be used internally. The libsndfile user facing sf_set_string()
would still expect UTF-8 input, but that UTF-8 input would be converted internally to UTF-16 and then lame's UTF-16 ID3v2 functions could be used. This also means that the on-disk format of the ID3v2 header would use UTF-16. On decode, we rely on mpg123 to convert back to UTF-8.
The use of libmpg123 on encode in this case would be for it's ID3 string handling, which already has a UTF-8 -> UTF-16 conversion.
Another option would be to add a platform abstract UTF-8 -> UTF-16 internal function to libsndfile, and use mbstowc()
on *nix, MultiByteToWideChar()
on Windows.
Is there any chance libsndfile could write ID3v2.4 UTF-8 tags (by specifying the 0x03 encoding byte)? I would think that might be the ideal option (as a default, without exposing any additional ID3 features), but I am not sure lame supports ID3v2.4...
Lame doesn't support ID3v2.4 sadly. I'd prefer UTF-8-in-ID3, but until liblame supports ID3v2.4, and until there was confidence that most other software understood ID3v2.4, I don't think its a solution (yet.)
Thanks @arthurt !
What I meant by using UTF-16, is that UTF-16 could be used internally. The libsndfile user facing
sf_set_string()
would still expect UTF-8 input, but that UTF-8 input would be converted internally to UTF-16 and then lame's UTF-16 ID3v2 functions could be used. This also means that the on-disk format of the ID3v2 header would use UTF-16. On decode, we rely on mpg123 to convert back to UTF-8.
For us, the on-disk format of the metadata matters quite a bit (can't really consider it internal). We use libsndfile only for writing MP3 files and trying to avoid using mpg123... Seems like a good number of software does support ID3v2.4 UTF-8 already. I did a quick test by writing ID3v2.4 UTF-8 tags with https://www.mp3tag.de/, and it looked like they were read correctly by Apple Music, VLC, Windows Media Player... But yeah, I am sure there are still quite a few holdouts too. Doesn't look like there is a lot of active liblame development :( I added a feature request for ID3v2.4 on their sourceforge, just in case: https://sourceforge.net/p/lame/feature-requests/84/ If I had more time maybe that's something I could take a stab at...
Hi @arthurt
I am looking into developing a patch to LAME that would add the functionality of writing ID3v2.4 UTF-8 tags. I have something that seems to work, but I am still ironing out the details and could use libsndfile's perspective.
Currently, LAME has two functions to control writing ID3v2.3 tag, id3tag_add_v2
and id3tag_v2_only
. The first one, id3tag_add_v2
, means both ID3v2.3 and ID3v1 tags will be written, while the second, id3tag_v2_only
, means only ID3v2.3 tag will be written. I am thinking of a similar pair of functions for ID3v2.4 UTF-8, id3tag_add_v2_4_UTF8
(both ID3v1 and ID3v2.4 will be written) and id3tag_v2_4_UTF8_only
(only ID3v2.4 tag will be written). Using either of the functions would disable writing ID3v2.3 tags.
No encoding conversion will be done, so the tag text need to be provided as UTF-8, and id3tag_add_v2_4_UTF8
will end up writing UTF-8 into ID3v1 tags (not quite in-spec) in addition to writing ID3v2.4 UTF-8 (correctly indicating the 0x03 encoding byte).
Which of these two alternatives would libsndfile prefer to use?
id3tag_add_v2_4_UTF8
)id3tag_v2_4_UTF8_only
)Using only ID3v2.4 tag (Option 2) would be more in-spec (not dealing with UTF-8 data in ID3v1) and future-looking. Most software seems to support reading ID3v2.4 correctly, and it has been around for >20 years.
What do you think?
By spec, ID3v1 is limited to latin-1. So I think if there were to be a function which wrote UTF-8 tags, when writing the ID3v1 tag it should do a conv
to latin-1, which would do the right thing for characters in latin-1 (eg, é) but could end up just being a string of ????
for characters that aren't (eg, Japanese). Writing UTF-8 in ID3v1 is not the correct answer.
So, in answer to your question, I guess the latter.
Agreed, thanks @arthurt!
I was just thinking that writing UTF-8 to ID3v1 is the current (buggy) behavior, so maybe there was some argument about keeping it for backwards compatibility... But I agree that keeping it as only ID3v2.4 with UTF-8 is the better approach.
We will see how it goes with proposing changes to LAME.
Hi @arthurt and @evpobr ,
I have posted my patch to LAME to add the functionality of writing ID3v2.4 UTF-8 tags here: https://sourceforge.net/p/lame/patches/99/#aad8. As I was describing above, it adds a pair of functions for ID3v2.4 UTF-8, id3tag_add_v2_4_UTF8
(both ID3v1 and ID3v2.4 will be written) and id3tag_v2_4_UTF8_only
(only ID3v2.4 tag will be written).
For my own work, I made a corresponding change in libsndfile (basically just adding a id3tag_v2_4_UTF8_only (pmpeg->lamef) ;
call in mpeg_l3_encoder_write_id3tag()
in libsndfile/src/mpeg_l3_encode.c
). This makes libsndfile to only write ID3v2.4 tags with 0x03 encoding byte which indicates UTF-8 encoding. As libsndfile already always uses UTF-8, this should create correct UTF-8-encoded ID3v2.4 tags. Should I make a pull request with this change?
The tricky part is that I doubt LAME will integrate my patch any time soon (if at all) - there haven't been much development there in recent years. And this change in libsndfile only makes sense after the LAME patch is applied. So it probably wouldn't make sense to integrate this pull request into libsndfile. But maybe there is still some value to have this change publicly visible as an unaccepted pull request in case people what to make the changes to their copies of both LAME and libsndfile on their own?
Let me know what you think!
Hello libsndfile developers!
We write non-ASCII (UTF8-encoded) metadata in MP3 files using libsndfile, but we are observing issues with reading it back in other software. For example it shows up wrong in Windows Media Player and in Apple Music:
Expected text:
Wrong text:
We use libsndfile for writing other formats, and I don't think we had any similar issues with other formats. Here is an interesting example - we write UTF8-encoded metadata in MP3 and OGG in exactly the same way with libsndfile, using
sf_set_string
. VLC media player sees OGG's metadata correctly, but not MP3's:I would appreciate any pointers regarding this, specifically:
We are using libsndfile 1.1.0.
Thanks!