libsidplayfp / sidplayfp

Console SID/MUS player
GNU General Public License v2.0
34 stars 8 forks source link

stilview: handle character conversion #17

Closed drfiemost closed 1 year ago

drfiemost commented 1 year ago

See (#16)

mywave82 commented 1 year ago

Looks clean and simple atleast :-)

drfiemost commented 1 year ago

Yep, luckily the dirty part of the work was already in place. For some weird reason it doesn't work well on MinGW/Windows with the default codepage (CP850). As a workaround I have to switch to CP1252.

mywave82 commented 1 year ago

I have not looked at code in sidplayfp iconv() in details yet, but I can take a look at it later. Some quirks might be needed to handle characters that are not translatable and different implementations of iconv

mywave82 commented 1 year ago

codeConvert.cpp , codeConvert::convert()

outPtr/outLeft could potentionally overflow due to not ensuring space for termination.

const char* codeConvert::convert(const char* src)
{
#ifdef HAVE_ICONV
    if (cd == (iconv_t) -1)
        return src;

    char *srcPtr = const_cast<char*>(src);
    size_t srcLeft = strlen(src);
    char *outPtr = buffer;
    size_t outLeft = sizeof(buffer) - 1; //////////    sizeof makes the size automatically be correct, and -1 in order to guarantee that the terminator can fit

    while (srcLeft > 0)
    {
        size_t ret = iconv(cd, &srcPtr, &srcLeft, &outPtr, &outLeft);
        if (ret == (size_t) -1)
        {
            if (outLeft && srcLeft && (errno != E2BIG)) //////// if conversion stopped due to untranslatable character, skip it
            {
                    *(outPtr++) = '_';
                    outLeft--;
                    srcPtr++;
                    srcLeft--;
            } else {
                    break;
            }
        }
    }

    // flush
    iconv(cd, NULL, &srcLeft, &outPtr, &outLeft);

    // terminate buffer string
    *outPtr = 0;

    return buffer;
#else
    return src;
#endif
}
drfiemost commented 1 year ago

Interesting, will check what happens on MinGW. In the meantime if you want to submit a PR for the buffer size fix it would be nice.

mywave82 commented 1 year ago

Will try to make PR tomorrow. Sitting in wrong OS with a different project ATM

drfiemost commented 1 year ago

Fun thing is that the iconv program performs the conversion correctly while using the library I get a question mark in place of ü. 😕

mywave82 commented 1 year ago

First question 1) where are you testing? 2) what kind of question mark?

The diamond-question mark usually appears if the console is utf-8 and you generate text that that is for instance cp850 or cp437 and you hit the characters that supposed to be the utf-8 multibyte sequence

mywave82 commented 1 year ago

If I modify the loop as suggested earlier, I get this (expected) behaviour when using a UTF-8 terminal in gnome/linux:

$ LANG="en_US.ASCII" ./stilview -l=/home/stian/sid/C64Music -e=/MUSICIANS/H/Huelsbeck_Chris/Great_Giana_Sisters.sid
---- GLOBAL  COMMENT ----
COMMENT: H_lsbeck's own comments are denoted (CH).
         More of Chris H_lsbeck's C64 music can be found in
$ LANG="en_US.UTF-8" ./stilview -l=/home/stian/SORT/ocp-test/sid/C64Music -e=/MUSICIANS/H/Huelsbeck_Chris/Great_Giana_Sisters.sid
---- GLOBAL  COMMENT ----
COMMENT: Hülsbeck's own comments are denoted (CH).
         More of Chris Hülsbeck's C64 music can be found in
drfiemost commented 1 year ago

I'm testing on MSYS2/MinGW64 installed on Windows 10. I get a normal question mark with CP850, the default Windows codepage.

mywave82 commented 1 year ago

I do not have MSYS2/MinGW64 development environment on my windows system - I do however have TortoiseGIT and GIT for windows with the command line and bash available running under MinGW64, and it is configured like this:

$ uname
MINGW64_NT-10.0-19045
$ export|grep LC
declare -x LC_CTYPE="en_GB.UTF-8"

And that you are seeing question marks makes be believe that your terminal is actually trying to decode the text as UTF-8.

CP850 text will get question marks if attempted to be decoded as UTF-8: ü = 0x81 in CP850, this would be byte 2, 3 or 4 if decoded as UTF-8-sequence - invalid as first byte => question mark Ü = 0x9a in CP850, this would be byte 2, 3 or 4 if decoded as UTF-8-sequence - invalid as first byte => question mark

drfiemost commented 1 year ago
$ uname
MINGW64_NT-10.0-19044
$ env|grep LC
LC_CTYPE=it_IT.CP850

BTW the code uses GetConsoleOutputCP() to get the codepage on Windows which can be set/read with chcp.com

With UTF-8 I get:

$ chcp.com 65001
Active code page: 65001

$ MSYS2_ARG_CONV_EXCL='-e=' src/stilview  -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid
   NAME: The Deep
 AUTHOR: Chris Hülsbeck
COMMENT: Re-edit of /MUSICIANS/H/Huelsbeck_Chris/Compilation_III.sid, Tune #4.
mywave82 commented 1 year ago

LC_CTYPE and chcp must always match, and I expect chcp not to touch LC_CTYPE. I am not experience to know how well chcp.com codepage changes interact with console applications compiled under MSYS2, but I would expect it to work.

a) Do they match when you start the console/command line? I see that running chcp in cmd.exe returns the current active code page. b) Do you get the correct output if you ensure that they both match?

chcp
echo $LC_CTYPE

chcp 850
LC_CTYPE=it_IT.CP850 MSYS2_ARG_CONV_EXCL='-e=' src/stilview  -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid

chcp 65001
LC_CTYPE=it_IT.UTF-8 MSYS2_ARG_CONV_EXCL='-e=' src/stilview  -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid
mywave82 commented 1 year ago

I missed your comment about GetConsoleOutputCP() btw, should probably add a debug printf before iconv_open to verify the "encoding" variable is what you expect.

And verify that iconv_open is successfull! It might have failed, and that being your main problem

drfiemost commented 1 year ago

I think I've found it, it's the call to setlocale(LC_ALL, ""); that causes the problem. Will rebase and fix it later.

drfiemost commented 1 year ago

Looks good now, thanks for the help