quodlibet / mutagen

Python module for handling audio metadata
https://mutagen.readthedocs.io
GNU General Public License v2.0
1.54k stars 158 forks source link

mid3v2 crashes with "UnicodeEncodeError: surrogates not allowed" on files with accented characters in the filename #648

Open martinwguy opened 4 months ago

martinwguy commented 4 months ago

Trying to see whether ISRC tags are present in a large audio collection using mid3v2 -l 00*/*3 | grep -a TSRC it dies halfway through, saying

IDv2 tag info for 00-225167/mina - volami nel cuore.mp3
TIT2=Volami nel cuore
TPE1=MINA
TRCK=1
IDv2 tag info for Traceback (most recent call last):
  File "/usr/bin/mid3v2", line 33, in <module>
    sys.exit(load_entry_point('mutagen==1.46.0', 'console_scripts', 'mid3v2')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 484, in entry_point
    return main(sys.argv)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 469, in main
    list_tags(args)
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 335, in list_tags
    print("IDv2 tag info for", filename)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc85' in position 13: surrogates not allowed

This isn't Mina's fault; it's the following file's name which is ANSI or CP437 encoded: "modà - la notte.mp3" where à is represented by character 0x85. The same goes for other files whose names contain 0x8A for è, 0xB4 for é, 0x95 for ò, 0x97 for ù, 0xA2 for ó and so on.

On Debian GNU/Linux with LANG=en_GB.UTF-8

lazka commented 3 months ago

Python is sadly still broken in this case. We could reopen stdout etc with surrogateescape to work around it.

antlarr commented 3 months ago

I think that this is not mutagen's nor python's fault. If the filename is encoded in CP437 and not UTF-8, which is what python expects according to your LANG setting, then I'd say the best fix is to reencode the filenames correctly.

This can be done with: convmv -f cp437 -t utf-8 *. That will just show how the files will be renamed but doesn't do any change. Once you check that the encoding is right, you can run: convmv -f cp437 -t utf-8 --notest * to actually change the filenames in disk.