quodlibet / mutagen

Python module for handling audio metadata
https://mutagen.readthedocs.io
GNU General Public License v2.0
1.52k stars 159 forks source link

When saving ID3v1, tags are written in Latin-1 encoding -- is this correct? #354

Open otherjon opened 5 years ago

otherjon commented 5 years ago

When saving an ID3 tag, e.g. with the following code, the ID3v1 tag is encoded with Latin-1:

from mutagen.id3 import ID3, TIT2, Encoding, ID3v1SaveOptions
fn = "./Créature.mp3"
title = u'Créature'    # unicode char 'é' is code point E9, Latin-1 encoding \xe9, UTF-8 encoding \xc3\xa9
i3data = ID3()
i3data.add(TIT2(encoding=Encoding.UTF8, text=title))
i3data.save(fn, v1=ID3v1SaveOptions.CREATE)

This appears to be very intentional, as it is hard-coded in the _id3v1.MakeID3v1 function:

    for v2id, name in {"TIT2": "title", "TPE1": "artist",
                       "TALB": "album"}.items():
        if v2id in id3:
            text = id3[v2id].text[0].encode('latin1', 'replace')[:30]
    ...
    if "COMM" in id3:
        cmnt = id3["COMM"].text[0].encode('latin1', 'replace')[:28]

However, when ID3v1 tags are written in Latin-1, mutagen itself can't read them back in. Also, other ID3v1 tools such as mp3info write such characters in Unicode rather than Latin-1 (and fail to read Latin-1 tags that mutagen has written).

By manually editing my local copy of mutagen and replacing "latin1" with "utf8" in the lines above, I experimentally confirm that ID3v1 tags written in UTF-8 are readable both by mutagen and by mp3info. Is there a spec which requires Latin-1 encoding for ID3v1, or other strong reason to leave the encoding as-is? (If so, could you point me to a reference?)

otherjon commented 5 years ago

Correction to the above: mutagen can read its own format, but it disagrees with the format of mp3info (among others). Furthermore, regardless of the encoding mutagen uses for writing (e.g. Encoding.UTF8 above), when mutagen reads in the resulting file, the text frame always registers Encoding.LATIN1 (which is incorrect in this case, and presumably related to the issue I'm seeing). That's the only way I can create a failing test using only mutagen: add a frame with Encoding.UTF8, write the file, read it back in, note that the frame has Encoding.LATIN1.

A file with ID3v1 tags set by mp3info, when read in by mutagen, shows the following:

>>> with open(fn) as f: tags, offset=find_id3v1(f)
>>> tags
{'TPE1': TPE1(encoding=<Encoding.LATIN1: 0>, text=[u'Beyonc\xc3\xa9'])}

Note the unicode text, which should be u'Beyonc\xe9'. (u'Beyonc\xe9' == 'Beyonc\xc3\xa9'.decode('utf8'), but of course utf8 != Encoding.LATIN1, which is the problem.)

lazka commented 5 years ago

We assume id3v1 is latin1 because id3v1 doesn't specify an encoding and this way we at least get the data back out since every byte string is valid latin1.

Another reason is that in Quod Libet we generally distrust latin1 frames and try to decode them differently because they are in many cases from old software and wrong. The same is true for id3v1 fields which allows us to handle both problems in one go.

lazka commented 5 years ago

I was wondering which encoding mp3info uses and it turns out it just copies the bytes as is, so depends on the locale encoding.