jacquesh / foo_openlyrics

An open-source lyric display panel for foobar2000
MIT License
401 stars 24 forks source link

bad decoding of 'windows-1252', 'iso-8859-1' #364

Open oevesque opened 1 month ago

oevesque commented 1 month ago

if the txt file is in 'windows-1252', 'iso-8859-1' format, the lyrics are badly show as chinese characters in 1 line.

Steps to reproduce

  1. try the txt file below 13 - La Non-Demande En Mariage.txt

Expected behavior

show multiple lines in french

Versions

Debug logs

no error on debug logs

Additional information

A small python script to force my lyrics to UTF-8, but some files are read-only and I can't convert them.

import os
import chardet

# Function to convert a file from ANSI to UTF-8
def convert_file(file_path):
    # Detect the current encoding of the file
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
        current_encoding = result['encoding']

    # If the current encoding is not ANSI, skip the file
    if current_encoding.lower() not in ['windows-1252', 'iso-8859-1']:
        print(f"Skipping {file_path} (encoding: {current_encoding})")
        return

    # Read the file content in ANSI encoding
    with open(file_path, 'r', encoding=current_encoding) as f:
        content = f.read()

    # Write the content back to the file in UTF-8 encoding
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)

    print(f"Converted {file_path} from {current_encoding} to UTF-8")

# Get the current working directory
current_dir = os.getcwd()

# Loop through all files in the directory
for filename in os.listdir(current_dir):
    # Check if the file is a .txt file
    if filename.endswith('.txt'):
        file_path = os.path.join(current_dir, filename)
        convert_file(file_path)
jacquesh commented 1 month ago

When the input isn't UTF-8, our only option is to guess at the encoding by just trying a whole bunch of them and going with whatever encoding first succeeds (which may or may not be the intended one). Based on your above sample, I suppose I could look into how python's chardet.detect works and see if we can mimic that behaviour.

Related to #105, ideally we should allow the user to change the encoding but then we'd need to either 1) save it, which presumably wouldn't work in your case anyway if files are read-only (although I'd venture that many things are likely to not work very well if the required files are read-only) or 2) keep an internal database of "what is the encoding for this lyric file?", which is more work but probably also "more correct". I've not yet tried your particular file but my guess is just that it happens to be valid with codepage 936/949/950 (which are chinese & korean and tried before 1252 because they get tried in roughly sequential order). We do already try the "system default codepage" so if your machine is set to french then I would expect it to try that first but I also don't have any idea how it decides what the "system default codepage" is, so maybe not.

Honestly the actual re-encoding itself isn't particularly complicated, the main reason I've not done it yet is that it's a bit of a pain to do the UI properly.

I should add though, that we do output debug logs about the decoding so if you enable debug logs in preferences, you should see some info about what encodings were tried and which one was ultimately used.

oevesque commented 1 month ago

Thx for you reply. Output log with debug activated: INFO-OpenLyrics: Lookup local-file file://D:\Music\Mp3 Francais\Georges Brassens - Anthologie\13 - La Non-Demande En Mariage.txt for lyrics... INFO-OpenLyrics: Successfully retrieved lyrics from file://D:\Music\Mp3 Francais\Georges Brassens - Anthologie\13 - La Non-Demande En Mariage.txt INFO-OpenLyrics: Successfully looked-up lyrics from source: Local files INFO-OpenLyrics: Parsing lyrics text... INFO-OpenLyrics: Successfully converted 1713 bytes of UTF-16 into UTF-8 INFO-OpenLyrics: Parsing LRC lyric text... INFO-OpenLyrics: Lyric loading complete INFO-OpenLyrics: Skipping lyric save. Type: 1, Local: yes, Timestamped: no, Autosave: 1