nschlia / ffmpegfs

FUSE-based transcoding filesystem with video support from many formats to FLAC, MP4, TS, WebM, OGG, MP3, HLS, and others.
https://nschlia.github.io/ffmpegfs/
GNU General Public License v3.0
206 stars 14 forks source link

CUE sheet UTF-8 encoding problem - broken track names with accented characters. #133

Closed cybern0id closed 2 years ago

cybern0id commented 2 years ago

Issue:

Certain UTF-8 encoded CUE sheets result in ffmpegfs transcoded flac.track/ files with broken track names.

Symptom:

For example, this CUE sheet:

~:$ file -i flac/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.cue
flac/Air/2012 - Le Voyage dans la Lune/Air - Le Voyage dans la Lune.cue: text/plain; charset=utf-8

~:$ cat flac/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.cue
REM DISCID 8A07580B
PERFORMER "Air"
TITLE "Le Voyage dans la Lune"
CATALOG 5099995563329
REM DATE 2012
REM DISCNUMBER 1
REM TOTALDISCS 1
REM COMMENT "CUERipper v2.2.1 Copyright (C) 2008-2022 Grigory Chudov"
FILE "Air - Le Voyage dans la Lune.flac" WAVE
  TRACK 01 AUDIO
    PERFORMER "Air"
    TITLE "Astronomic Club"
    INDEX 01 00:00:00
  TRACK 02 AUDIO
    PERFORMER "Air"
    TITLE "Seven Stars"
    INDEX 01 03:12:47
  TRACK 03 AUDIO
    PERFORMER "Air"
    TITLE "Retour sur Terre"
    INDEX 01 07:35:36
  TRACK 04 AUDIO
    PERFORMER "Air"
    TITLE "Parade"
    INDEX 01 08:08:15
  TRACK 05 AUDIO
    PERFORMER "Air"
    TITLE "Moon Fever"
    INDEX 01 10:40:71
  TRACK 06 AUDIO
    PERFORMER "Air"
    TITLE "Sonic Armada"
    INDEX 01 14:15:15
  TRACK 07 AUDIO
    PERFORMER "Air"
    TITLE "Who Am I Now?"
    INDEX 01 19:20:15
  TRACK 08 AUDIO
    PERFORMER "Air"
    TITLE "Décollage"
    INDEX 01 22:20:57
  TRACK 09 AUDIO
    PERFORMER "Air"
    TITLE "Cosmic Trip"
    INDEX 01 23:58:34
  TRACK 10 AUDIO
    PERFORMER "Air"
    TITLE "Homme Lune"
    INDEX 01 28:08:42
  TRACK 11 AUDIO
    PERFORMER "Air"
    TITLE "Lava"
    INDEX 01 28:26:69

~:$ sudo mount /mnt/ffmpegfs/
~:$ ls -lah /mnt/ffmpegfs/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.flac.tracks/
total 725M
drw-r--r-- 2 user user 4.0K Apr 21 23:09  .
drwxr-xr-x 2 user user 4.0K Apr 21 23:24  ..
-rw-r--r-- 1 user user  75M Apr 21 23:09 '01. Air - Astronomic Club [03-12.626].webm'
-rw-r--r-- 1 user user 102M Apr 21 23:09 '02. Air - Seven Stars [04-22.853].webm'
-rw-r--r-- 1 user user  13M Apr 21 23:09 '03. Air - Retour sur Terre [00-32.720].webm'
-rw-r--r-- 1 user user  59M Apr 21 23:09 '04. Air - Parade [02-32.746].webm'
-rw-r--r-- 1 user user  83M Apr 21 23:09 '05. Air - Moon Fever [03-34.253].webm'
-rw-r--r-- 1 user user 118M Apr 21 23:09 '06. Air - Sonic Armada [05-05.000].webm'
-rw-r--r-- 1 user user  70M Apr 21 23:09 '07. Air - Who Am I Now? [03-00.560].webm'
-rw-r--r-- 1 user user  38M Apr 21 23:09 '08. Air - DĂŠcollage [01-37.693].webm'
-rw-r--r-- 1 user user  97M Apr 21 23:09 '09. Air - Cosmic Trip [04-10.106].webm'
-rw-r--r-- 1 user user 7.1M Apr 21 23:09 '10. Air - Homme Lune [00-18.360].webm'
-rw-r--r-- 1 user user  67M Apr 21 23:09 '11. Air - Lava [02-53.133].webm'

Other UTF-8 encoded CUE sheets that have multiple characters with accents, either in one track name or multiple track names each with one accented character do not suffer this problem. For example:

~:$ file -i flac/Air/1998\ -\ Moon\ Safari/Air\ -\ Moon\ Safari.cue
flac/Air/1998 - Moon Safari/Air - Moon Safari.cue: text/plain; charset=utf-8

$ cat flac/Air/1998\ -\ Moon\ Safari/Air\ -\ Moon\ Safari.cue
REM DISCID 7B0A420A
PERFORMER "Air"
TITLE "Moon Safari"
CATALOG 0724384497828
REM DATE 1998
REM DISCNUMBER 1
REM TOTALDISCS 1
REM COMMENT "CUERipper v2.2.1 Copyright (C) 2008-2022 Grigory Chudov"
FILE "Air - Moon Safari.flac" WAVE
  TRACK 01 AUDIO
    PERFORMER "Air"
    TITLE "La Femme d’argent"
    INDEX 01 00:00:00
  TRACK 02 AUDIO
    PERFORMER "Air"
    TITLE "Sexy Boy"
    INDEX 01 07:11:22
  TRACK 03 AUDIO
    PERFORMER "Air"
    TITLE "All I Need"
    INDEX 01 12:09:57
  TRACK 04 AUDIO
    PERFORMER "Air"
    TITLE "Kelly, Watch the Stars!"
    INDEX 01 16:38:05
  TRACK 05 AUDIO
    PERFORMER "Air"
    TITLE "Talisman"
    INDEX 01 20:23:37
  TRACK 06 AUDIO
    PERFORMER "Air"
    TITLE "Remember"
    INDEX 01 24:40:15
  TRACK 07 AUDIO
    PERFORMER "Air"
    TITLE "You Make It Easy"
    INDEX 01 27:14:37
  TRACK 08 AUDIO
    PERFORMER "Air"
    TITLE "Ce matin-là"
    INDEX 01 31:16:02
  TRACK 09 AUDIO
    PERFORMER "Air"
    TITLE "New Star in the Sky (Chanson pour Solal)"
    INDEX 01 34:55:05
  TRACK 10 AUDIO
    PERFORMER "Air"
    TITLE "Le Voyage de Pénélope"
    INDEX 01 40:35:50

~:$ ls -lah /mnt/ffmpegfs/Air/1998\ -\ Moon\ Safari/Air\ -\ Moon\ Safari.flac.tracks/
total 1013M
drw-r--r-- 2 user user 4.0K Apr 21 16:31  .
drwxr-xr-x 2 user user 4.0K Apr 21 16:31  ..
-rw-r--r-- 1 user user 167M Apr 21 16:31 '01. Air - La Femme d’argent [07-11.293].webm'
-rw-r--r-- 1 user user 116M Apr 21 16:31 '02. Air - Sexy Boy [04-58.466].webm'
-rw-r--r-- 1 user user 104M Apr 21 16:31 '03. Air - All I Need [04-28.306].webm'
-rw-r--r-- 1 user user  87M Apr 21 16:31 '04. Air - Kelly, Watch the Stars! [03-45.426].webm'
-rw-r--r-- 1 user user  99M Apr 21 16:31 '05. Air - Talisman [04-16.706].webm'
-rw-r--r-- 1 user user  60M Apr 21 16:31 '06. Air - Remember [02-34.293].webm'
-rw-r--r-- 1 user user  94M Apr 21 16:31 '07. Air - You Make It Easy [04-01.533].webm'
-rw-r--r-- 1 user user  85M Apr 21 16:31 '08. Air - Ce matin-là [03-39.040].webm'
-rw-r--r-- 1 user user 132M Apr 21 16:31 '09. Air - New Star in the Sky (Chanson pour Solal) [05-40.600].webm'
-rw-r--r-- 1 user user  74M Apr 21 16:31 '10. Air - Le Voyage de Pénélope [03-10.866].webm'

Background:

My /etc/fstab:

/media/user/blkid/Music/flac   /mnt/ffmpegfs  fuse.ffmpegfs   allow_other,ro,desttype=webm+opus,cachepath=/media/user/blkid/ffmpegfs-cache,max_cache_size=16G,expiry_time=3d,logfile=/var/log/ffmpegfs.log,log_maxlevel=DEBUG,min_diskspace=5G  0       0

I compiled ffmpegfs from git main:

~:$ ffmpegfs --version
-------------------------------------------------------------------------------------------
Built with          : gcc 11.2.0 (linux-gnu)
configuration       : 

FFMPEGFS Version    : 2.10
FFmpeg Version      : 4.4.1-3+b2
Video CD Library    : enabled
FUSE library version: 2.9.9
fusermount3 version: 3.10.5
using FUSE kernel interface version 7.19

Kernel:

~:$ uname -a
Linux unibox 5.16.0-6-amd64 #1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux

I've been ripping CDs with CUERipper using wine in Devuan Daedalus (which is based on Debian Testing/Bookworm) to single FLAC files with embedded and separate CUE sheets. On releases that have track names including characters with accents (i.e. non- ASCII / ANSI), for example characters like é or ĉ or à, usually result in CUE sheet files that are not UTF-8 encoded. For example:

:$ file -i flac/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.cue
flac/Air/2012 - Le Voyage dans la Lune/Air - Le Voyage dans la Lune.cue: text/plain; charset=iso-8859-1
:$ file -i cuebackups/Air\ -\ Moon\ Safari.cue
cuebackups/Air - Moon Safari.cue: text/plain; charset=unknown-8bit

I've determined that "unknown-8bit" encoding is actually WINDOWS-1252 as you might expect.

I want all my CUE sheets to be UTF-8 encoded and *nix compliant and so I used iconv and then dos2unix (to remove \<CRLF>) to achieve this.

The single album FLAC files given in my examples both have embedded CUE sheets as well as a separate one in the same folder. For the buggy one; either removing the separate UTF-8 encoded CUE sheet from the album folder (so that ffmpegfs reads the embedded one) or using the original ISO-8859-1 encoded one results in no track name bug.

Strangely, if I edit the buggy UTF-8 CUE sheet to either add another accented character like é to the bugged track name or add one in /certain/ other positions within a different track name, then remount ffmpegfs, the bug disappears. For example:

~:$ cat flac/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.cue
REM DISCID 8A07580B
PERFORMER "Air"
TITLE "Le Voyage dans la Lune"
CATALOG 5099995563329
REM DATE 2012
REM DISCNUMBER 1
REM TOTALDISCS 1
REM COMMENT "CUERipper v2.2.1 Copyright (C) 2008-2022 Grigory Chudov"
FILE "Air - Le Voyage dans la Lune.flac" WAVE
  TRACK 01 AUDIO
    PERFORMER "Air"
    TITLE "Astronomic Club"
    INDEX 01 00:00:00
  TRACK 02 AUDIO
    PERFORMER "Air"
    TITLE "Seven Stars"
    INDEX 01 03:12:47
  TRACK 03 AUDIO
    PERFORMER "Air"
    TITLE "Retour sur Terre"
    INDEX 01 07:35:36
  TRACK 04 AUDIO
    PERFORMER "Air"
    TITLE "Parade"
    INDEX 01 08:08:15
  TRACK 05 AUDIO
    PERFORMER "Air"
    TITLE "Moon Fever"
    INDEX 01 10:40:71
  TRACK 06 AUDIO
    PERFORMER "Air"
    TITLE "Sonic Armada"
    INDEX 01 14:15:15
  TRACK 07 AUDIO
    PERFORMER "Air"
    TITLE "Who Am I Now?"
    INDEX 01 19:20:15
  TRACK 08 AUDIO
    PERFORMER "Air"
    TITLE "Décollageé"
    INDEX 01 22:20:57
  TRACK 09 AUDIO
    PERFORMER "Air"
    TITLE "Cosmic Trip"
    INDEX 01 23:58:34
  TRACK 10 AUDIO
    PERFORMER "Air"
    TITLE "Homme Lune"
    INDEX 01 28:08:42
  TRACK 11 AUDIO
    PERFORMER "Air"
    TITLE "Lava"
    INDEX 01 28:26:69

~:$ sudo umount /mnt/ffmpegfs
~:$ sudo mount /mnt/ffmpegfs
~:$ ls -lah /mnt/ffmpegfs/Air/2012\ -\ Le\ Voyage\ dans\ la\ Lune/Air\ -\ Le\ Voyage\ dans\ la\ Lune.flac.tracks/
total 725M
drw-r--r-- 2 user user 4.0K Apr 22 16:42  .
drwxr-xr-x 2 user user 4.0K Apr 22 16:42  ..
-rw-r--r-- 1 user user  75M Apr 22 16:42 '01. Air - Astronomic Club [03-12.626].webm'
-rw-r--r-- 1 user user 102M Apr 22 16:42 '02. Air - Seven Stars [04-22.853].webm'
-rw-r--r-- 1 user user  13M Apr 22 16:42 '03. Air - Retour sur Terre [00-32.720].webm'
-rw-r--r-- 1 user user  59M Apr 22 16:42 '04. Air - Parade [02-32.746].webm'
-rw-r--r-- 1 user user  83M Apr 22 16:42 '05. Air - Moon Fever [03-34.253].webm'
-rw-r--r-- 1 user user 118M Apr 22 16:42 '06. Air - Sonic Armada [05-05.000].webm'
-rw-r--r-- 1 user user  70M Apr 22 16:42 '07. Air - Who Am I Now? [03-00.560].webm'
-rw-r--r-- 1 user user  38M Apr 22 16:42 '08. Air - Décollageé [01-37.693].webm'
-rw-r--r-- 1 user user  97M Apr 22 16:42 '09. Air - Cosmic Trip [04-10.106].webm'
-rw-r--r-- 1 user user 7.1M Apr 22 16:42 '10. Air - Homme Lune [00-18.360].webm'
-rw-r--r-- 1 user user  67M Apr 22 16:42 '11. Air - Lava [02-53.133].webm'

My DEBUG logs do not have any information that makes me any wiser. Let me know if you'd like TRACE logs.

nschlia commented 2 years ago

That looks like the file set is misjudged. Probably a single accented è is not sufficient. Could you provide me with the problem cue files? Logs are not required. Send me no audio files, of course, only the cuesheets. Normally blurps like ĂŠ show up when an UTF-8 (2 byte) code is not properly converted. 8-bit files that are read with the wrong charset produce single characters only, for example, the German umlaut ö becomes a ÷ when taken as IBM850. Anyways, I'll have a look at it.

cybern0id commented 2 years ago

Thanks for taking a look! I really <3 ffmpegfs :)

Here are the CUE sheets from my examples. I've attached both the originals and the iconv converted ones, appending .txt to the filenames to satisfy github. The "Le Voyage dans la Lune" is the problematic one but both have accented characters in their track lists.

Original ISO-8559-1 encoded CUE sheet: Air - Le Voyage dans la Lune.cue.txt

iconv UTF-8 converted CUE sheet with the problem: Air - Le Voyage dans la Lune.cue.txt

Original "unknown-8bit" / WINDOWS-1252 / CP-1252 encoded CUE sheet: Air - Moon Safari.cue.txt

iconv converted UTF-8 CUE sheet: Air - Moon Safari.cue.txt

Regards, (edited to correct second cue sheet original character encoding type)

nschlia commented 2 years ago

It is like I expected, the UTF-8 file is misdetected as ISO-8859-2. There is only a single UTF-8 character in the word "Décollage", so the underlying library (libchardet) gets misled. When I change the word to "Décolláge" it is correctly detected as UTF-8.

Sorry, there is no way to fix that on my side. When checking the char set, libchardet uses heuristics that may sometimes fail. This is the first time I see that, though.

But you may fix the problem by adding a BOM (Byte Order Mark) to your UTF-8 files. This will avoid misdetections. You can do that with sed like:

sed -i '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' 'Air - Le Voyage dans la Lune.cue'

See https://stackoverflow.com/questions/1044595/how-can-i-re-add-a-unicode-byte-order-marker-in-linux/9815107#9815107

Just take care not to update the original files, though.

cybern0id commented 2 years ago

Thanks for looking into this and providing a solution! This works perfectly. I see other people have commented in the libchardet github issue tracker about mis-detection when there is only one UTF-8 accented or international character: https://github.com/Joungkyun/libchardet/issues/17

nschlia commented 2 years ago

You are welcome! Thanks for the hint, looks like the problem is known. But the issue is there since 2020 without progress, hope that there will be a fix one day.