libsndfile / libsndfile

A C library for reading and writing sound files containing sampled audio data.
http://libsndfile.github.io/libsndfile/
GNU Lesser General Public License v2.1
1.45k stars 384 forks source link

MP3 non-ASCII metadata encoding issues? #915

Open krisfed opened 1 year ago

krisfed commented 1 year ago

Hello libsndfile developers!

We write non-ASCII (UTF8-encoded) metadata in MP3 files using libsndfile, but we are observing issues with reading it back in other software. For example it shows up wrong in Windows Media Player and in Apple Music:

Expected text: expectedData

Wrong text:

AppleMusic WindowsMedia

We use libsndfile for writing other formats, and I don't think we had any similar issues with other formats. Here is an interesting example - we write UTF8-encoded metadata in MP3 and OGG in exactly the same way with libsndfile, using sf_set_string. VLC media player sees OGG's metadata correctly, but not MP3's:

VLC_MP3_vs_OGG

I would appreciate any pointers regarding this, specifically:

  1. Are there known issues or limitations about non-ASCII MP3 metadata in libsndfile?
  2. Should MP3 metadata be written using something other than UTF8?
  3. Is it more likely to be an issue with reading software and assumptions it's making or with the way we are writing the metadata with libsndfile?

We are using libsndfile 1.1.0.

Thanks!

krisfed commented 1 year ago

I looked at the resulting MP3 file in a hex editor, and it seems like libsndfile writes MP3 metadata as ID3v1. Is that right? From my limited understanding, ID3v1 either doesn't really support UTF-8 or there is no way to indicate which encoding is used in a ID3v1 tag... Please do correct me if this is wrong.

Is there any way to write ID3v2 (preferably ID3v2.4, which seem to explicitly support UTF-8) to MP3 files with libsndfile? Would appreciate any info!

evpobr commented 1 year ago

Hi @krisfed . AFAIK libsndfile writes and reads ID3v1 and ID3v2, no ID3v4 yet:

https://github.com/libsndfile/libsndfile/blob/71f7bde180ade1869542d36483bc2e2c809c2189/src/mpeg_l3_encode.c#L126-L191

@arthurt ?

krisfed commented 1 year ago

Thank you @evpobr ! Can writing of the ID3v2 tags be achieved through sf_set_string? I see lame_get_id3v2_tag is used to determine if it is a ID3v2 header, but I am not sure how to control that from the libsndfile's public interface... Would love an example or some pointers!

krisfed commented 1 year ago

I did a bit more digging, and I understand that there are some conventions about which encoding is used by which ID3 tag version. It is too bad that https://id3.org/ seems to be down at the moment, but this is what I was able to gather (do correct me if something below is off):

The above might explain why reading UTF-8 data from ID3v1 does not always work in all software (if it expects/assumes ISO-8859-1). It would be nice if libsndfile supported ID3v2.4 (?), which should explicitly support UTF-8. Using ID3v2.2/ID3v2.3 might be the next best option (?), but it still doesn't explicitly support UTF-8. Should UTF-16 be used instead for writing ID3v2.2/ID3v2.3 tags?

I am still unclear on how to make libsndfile use ID3v2.2/ID3v2.3 instead of ID3v1. Should the string passed into sf_set_string contain the encoding byte or something else that would indicate it is for writing into a ID3v2.2/ID3v2.3 tag? I see a comment here https://askubuntu.com/questions/622132/how-to-write-album-artist-id3-tag-using-lame that the command-line LAME tool at least at some point used the size of the data to determine whether to use ID3v2 (i.e. use ID3v2 if data won't fit into ID3v1) if specific '--add-id3v2' or '--id3v2-only' options were not used. Does libsndfile use a similar kind of logic?

Would appreciate any clarifications, corrections, or thoughts. Thanks!

arthurt commented 1 year ago

Libsndfile supports both ID3 v1 and ID3 v2, using both libmp3lame and libmpg123 to do the actual work.

https://github.com/libsndfile/libsndfile/blob/master/src/mpeg_decode.c#L462

When reading a file, v2 tags take precedence over v1 tags. libmpg123 does the right thing, and all strings it returns are UTF-8, regardless of the actual encoding of the tag, so all strings returned by sf_get_string() will also be UTF-8.

For encode, libsndfile works basically like lame --add-id3v2 and adds both a ID3v2 header and ID3v1 trailer.

However, your questions caused me to investigate, and currently there is a bug with the text encoding when setting ID3 tags. The current state of the code is that the values passed to sf_set_string() are passed to lame as latin1 (aka ISO-8859-1, aka CP-1252). If they happen to be UTF-8, they will be written into the file incorrectly as "latin1" for both ID3v1 and ID3v2. So that's a bug.

In general, the encoding of the text passed through sf_get_string()/sf_set_string() isn't really specified. For sane formats like Ogg and Flac, it's UTF-8 and we can relax. For WAV and WAV like, literally nobody knows. It isn't in the spec, some software uses latin1, others use UTF-8. All we know is that it isn't wide characters. I guess libsndfile is doing the only thing it can, and just reporting the stream of bytes that it sees.

krisfed commented 1 year ago

Thank you for the info, @arthurt ! Really appreciate your help with this.

To clarify, in our workflow we do not use libsndfile for reading, only for writing MP3 files.

Sounds like the bug with writing UTF-8 metadata as latin1 could explain why we are seeing other software (like Apple Music and Windows Media Player) not reading MP3 metadata correctly from files written by libsndfile.

It is useful to know that libsndfile is supposed to write both a ID3v2 header and ID3v1 trailer. The odd thing is that I don't see ID3v2 tags in the MP3 files written by libsndfile, only the ID3v1 tags. That's why I was asking how to make libsndfile use ID3v2 tags. From my (limited) understanding, ID3v1 tags begin with "TAG" and are at the end of the file. The ID3v2 usually occur at the start of the file and should start with "ID3".

This is what I am seeing when I look at the MP3 file written by libsndfile in a hex editor.

End of the file seem to contain ID3v1 tag as expected:

endOfLibsndfileMP3

But start of the file does not seem to have a ID3v2 as far as I can tell:

startOfLibsndfileMP3

Compare that with the start of a file where metadata was written by Apple Music, where I can see ID3v2 tag starting with "ID3":

startOfAppleMP3

Similarly when I look at those two files with Mp3tag (https://www.mp3tag.de/](https://www.mp3tag.de/ ), it shows the tag version is ID3v1 for file written with libsndfile, and ID3v2.2 for file with metadata written by Apple Music.

Would love to hear your thoughts on this! Thanks again for all the help.

arthurt commented 1 year ago

Turns out I was wrong on that point. I wrote the code, but it was a long time ago now :laughing:

libsndfile does not work like lame --add-id3v2, and liblame will only output an id3v2 header in circumstances where tag field sizes are exceeded,or non-latin1 text is added.

If you exceed an ID3v1 field length, you will see an ID3v2 header, which is latin1 encoded:

$ head -c 256 roygbiv-convert.mp3 | hexdump -C
00000000  49 44 33 03 00 00 00 00  01 10 54 53 53 45 00 00  |ID3.......TSSE..|
00000010  00 2f 00 00 00 4c 41 4d  45 20 36 34 62 69 74 73  |./...LAME 64bits|
00000020  20 76 65 72 73 69 6f 6e  20 33 2e 31 30 30 20 28  | version 3.100 (|
00000030  68 74 74 70 3a 2f 2f 6c  61 6d 65 2e 73 66 2e 6e  |http://lame.sf.n|
00000040  65 74 29 54 49 54 32 00  00 00 08 00 00 00 52 6f  |et)TIT2.......Ro|
00000050  79 67 62 69 76 54 50 45  31 00 00 00 11 00 00 00  |ygbivTPE1.......|
00000060  42 6f 61 72 64 73 20 6f  66 20 43 61 6e 61 64 61  |Boards of Canada|
00000070  54 41 4c 42 00 00 00 20  00 00 00 4d 75 73 69 63  |TALB... ...Music|
00000080  20 48 61 73 20 74 68 65  20 52 69 67 68 74 20 74  | Has the Right t|
00000090  6f 20 43 68 69 6c 64 72  65 6e ff fb 90 64 00 00  |o Children...d..|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 58 69  |..............Xi|
000000c0  6e 67 00 00 00 0f 00 00  16 aa 00 30 95 21 00 03  |ng.........0.!..|
000000d0  06 08 0c 0f 11 15 17 1a  1d 1f 23 26 29 2c 2e 32  |..........#&),.2|
000000e0  35 37 3a 3e 40 43 45 49  4c 4e 52 54 56 59 5b 5e  |57:>@CEILNRTVY[^|
000000f0  61 63 66 68 6b 6e 70 73  75 78 7a 7d 7f 82 84 87  |acfhknpsuxz}....|

$ tail -c 128 roygbiv-convert.mp3 | hexdump -C
00000000  54 41 47 52 6f 79 67 62  69 76 00 00 00 00 00 00  |TAGRoygbiv......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 42 6f 61 72 64 73 20  6f 66 20 43 61 6e 61 64  |.Boards of Canad|
00000030  61 00 00 00 00 00 00 00  00 00 00 00 00 00 00 4d  |a..............M|
00000040  75 73 69 63 20 48 61 73  20 74 68 65 20 52 69 67  |usic Has the Rig|
00000050  68 74 20 74 6f 20 43 68  69 6c 64 72 65 00 00 00  |ht to Childre...|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 ff  |................|
00000080

$ mpg123-id3dump roygbiv-convert.mp3 
FILE: roygbiv-convert.mp3

====      ID3v1       ====
Title: Roygbiv
Artist: Boards of Canada
Album: Music Has the Right to Childre
Year: 
Comment: 
Genre: 255
====      ID3v2       ====
Title: Roygbiv
Artist: Boards of Canada
Album: Music Has the Right to Children

==== ID3v2 Raw frames ====
TSSE language()
 LAME 64bits version 3.100 (http://lame.sf.net)
TIT2 language()
 Roygbiv
TPE1 language()
 Boards of Canada
TALB language()
 Music Has the Right to Children
arthurt commented 1 year ago

So, this leaves a question of what to do about the situation. The current solution is broken. Any text passed to sf_set_string() will be interpreted as if it was latin-1 encoded.

The constraints on the solution are that libsndfile's string API is a lowest-common-denominator approach for multiple formats and doesn't expose all the features of ID3v2.

So it doesn't make sense to pull in a proper ID3 library like taglib, or id3tag or libid3tag (why is the MP3/ID3 software space so fragmented!?). If a user needs all the features of ID3, they should link and use such a library directly.

So a solution would be to just support a subset of id3 as exposed by sf_get/set_string(), but then we have to deal with switching around text encodings. I would love to import iconv and just be done with it, but that is a bit heavy-weight, and not supported by Windows.

Libmpg123 has some support for UTF-8 -> UTF-16, so we could leverage that and only write UTF-16 ID3v2 tags, never writing ID3v1 tags.

krisfed commented 1 year ago

Thanks a lot for this explanation, @arthurt !

libsndfile does not work like lame --add-id3v2, and liblame will only output an id3v2 header in circumstances where tag field sizes are exceeded, or non-latin1 text is added.

Any text passed to sf_set_string() will be interpreted as if it was latin-1 encoded.

So currently libsndfile doesn't write ID3v2 tags even when there is non-ASCII data present because of this issue of interpreting everything as latin1? Is this latin1 issue coming from libsndfile's or lame's side?

Libmpg123 has some support for UTF-8 -> UTF-16, so we could leverage that and only write UTF-16 ID3v2 tags, never writing ID3v1 tags.

Just my two cents... I think it would be unfortunate to use mpg123 for writing workflows when so far it was only used for reading. And going with UTF-16 is also a bit out of model if libsndfile tries to use UTF-8 everywhere else.. Is there any chance libsndfile could write ID3v2.4 UTF-8 tags (by specifying the 0x03 encoding byte)? I would think that might be the ideal option (as a default, without exposing any additional ID3 features), but I am not sure lame supports ID3v2.4...

Thanks again for all your work on this! <3

arthurt commented 1 year ago

So currently libsndfile doesn't write ID3v2 tags even when there is non-ASCII data present because of this issue of interpreting everything as latin1? Is this latin1 issue coming from libsndfile's or lame's side?

It's a libsndfile bug. The libmp3lame api expects latin1 input arguments on the functions libsndfile is using. There are other functions that accept UTF-16, but only set ID3v2 fields.

Just my two cents... I think it would be unfortunate to use mpg123 for writing workflows when so far it was only used for reading.

I agree, I would like to keep the two separated. However, there already is some overlap of using both lame and mpg123 in the ID3 handling, as neither library is fully-featured.

And going with UTF-16 is also a bit out of model if libsndfile tries to use UTF-8 everywhere else..

What I meant by using UTF-16, is that UTF-16 could be used internally. The libsndfile user facing sf_set_string() would still expect UTF-8 input, but that UTF-8 input would be converted internally to UTF-16 and then lame's UTF-16 ID3v2 functions could be used. This also means that the on-disk format of the ID3v2 header would use UTF-16. On decode, we rely on mpg123 to convert back to UTF-8.

The use of libmpg123 on encode in this case would be for it's ID3 string handling, which already has a UTF-8 -> UTF-16 conversion.

Another option would be to add a platform abstract UTF-8 -> UTF-16 internal function to libsndfile, and use mbstowc() on *nix, MultiByteToWideChar() on Windows.

Is there any chance libsndfile could write ID3v2.4 UTF-8 tags (by specifying the 0x03 encoding byte)? I would think that might be the ideal option (as a default, without exposing any additional ID3 features), but I am not sure lame supports ID3v2.4...

Lame doesn't support ID3v2.4 sadly. I'd prefer UTF-8-in-ID3, but until liblame supports ID3v2.4, and until there was confidence that most other software understood ID3v2.4, I don't think its a solution (yet.)

krisfed commented 1 year ago

Thanks @arthurt !  

What I meant by using UTF-16, is that UTF-16 could be used internally. The libsndfile user facing sf_set_string() would still expect UTF-8 input, but that UTF-8 input would be converted internally to UTF-16 and then lame's UTF-16 ID3v2 functions could be used. This also means that the on-disk format of the ID3v2 header would use UTF-16. On decode, we rely on mpg123 to convert back to UTF-8.

For us, the on-disk format of the metadata matters quite a bit (can't really consider it internal). We use libsndfile only for writing MP3 files and trying to avoid using mpg123...   Seems like a good number of software does support ID3v2.4 UTF-8 already. I did a quick test by writing ID3v2.4 UTF-8 tags with https://www.mp3tag.de/, and it looked like they were read correctly by Apple Music, VLC, Windows Media Player... But yeah, I am sure there are still quite a few holdouts too.   Doesn't look like there is a lot of active liblame development :( I added a feature request for ID3v2.4 on their sourceforge, just in case: https://sourceforge.net/p/lame/feature-requests/84/   If I had more time maybe that's something I could take a stab at...

krisfed commented 1 year ago

Hi @arthurt

I am looking into developing a patch to LAME that would add the functionality of writing ID3v2.4 UTF-8 tags. I have something that seems to work, but I am still ironing out the details and could use libsndfile's perspective.

Currently, LAME has two functions to control writing ID3v2.3 tag, id3tag_add_v2 and id3tag_v2_only. The first one, id3tag_add_v2, means both ID3v2.3 and ID3v1 tags will be written, while the second, id3tag_v2_only, means only ID3v2.3 tag will be written. I am thinking of a similar pair of functions for ID3v2.4 UTF-8, id3tag_add_v2_4_UTF8 (both ID3v1 and ID3v2.4 will be written) and id3tag_v2_4_UTF8_only (only ID3v2.4 tag will be written). Using either of the functions would disable writing ID3v2.3 tags.

No encoding conversion will be done, so the tag text need to be provided as UTF-8, and id3tag_add_v2_4_UTF8 will end up writing UTF-8 into ID3v1 tags (not quite in-spec) in addition to writing ID3v2.4 UTF-8 (correctly indicating the 0x03 encoding byte).

Which of these two alternatives would libsndfile prefer to use?

  1. always writing both ID3v1 and ID3v2.4 tags with UTF-8 data in both (id3tag_add_v2_4_UTF8)
  2. always writing only the newer ID3v2.4 tag with UTF-8 data (id3tag_v2_4_UTF8_only)

Using only ID3v2.4 tag (Option 2) would be more in-spec (not dealing with UTF-8 data in ID3v1) and future-looking. Most software seems to support reading ID3v2.4 correctly, and it has been around for >20 years.

What do you think?

arthurt commented 1 year ago

By spec, ID3v1 is limited to latin-1. So I think if there were to be a function which wrote UTF-8 tags, when writing the ID3v1 tag it should do a conv to latin-1, which would do the right thing for characters in latin-1 (eg, é) but could end up just being a string of ???? for characters that aren't (eg, Japanese). Writing UTF-8 in ID3v1 is not the correct answer.

So, in answer to your question, I guess the latter.

krisfed commented 1 year ago

Agreed, thanks @arthurt!

I was just thinking that writing UTF-8 to ID3v1 is the current (buggy) behavior, so maybe there was some argument about keeping it for backwards compatibility... But I agree that keeping it as only ID3v2.4 with UTF-8 is the better approach.

We will see how it goes with proposing changes to LAME.

krisfed commented 1 year ago

Hi @arthurt and @evpobr ,

I have posted my patch to LAME to add the functionality of writing ID3v2.4 UTF-8 tags here: https://sourceforge.net/p/lame/patches/99/#aad8. As I was describing above, it adds a pair of functions for ID3v2.4 UTF-8, id3tag_add_v2_4_UTF8 (both ID3v1 and ID3v2.4 will be written) and id3tag_v2_4_UTF8_only (only ID3v2.4 tag will be written).

For my own work, I made a corresponding change in libsndfile (basically just adding a id3tag_v2_4_UTF8_only (pmpeg->lamef) ; call in mpeg_l3_encoder_write_id3tag() in libsndfile/src/mpeg_l3_encode.c). This makes libsndfile to only write ID3v2.4 tags with 0x03 encoding byte which indicates UTF-8 encoding. As libsndfile already always uses UTF-8, this should create correct UTF-8-encoded ID3v2.4 tags. Should I make a pull request with this change?

The tricky part is that I doubt LAME will integrate my patch any time soon (if at all) - there haven't been much development there in recent years. And this change in libsndfile only makes sense after the LAME patch is applied. So it probably wouldn't make sense to integrate this pull request into libsndfile. But maybe there is still some value to have this change publicly visible as an unaccepted pull request in case people what to make the changes to their copies of both LAME and libsndfile on their own?

Let me know what you think!