Closed drfiemost closed 1 year ago
Looks clean and simple atleast :-)
Yep, luckily the dirty part of the work was already in place. For some weird reason it doesn't work well on MinGW/Windows with the default codepage (CP850). As a workaround I have to switch to CP1252.
I have not looked at code in sidplayfp iconv() in details yet, but I can take a look at it later. Some quirks might be needed to handle characters that are not translatable and different implementations of iconv
codeConvert.cpp , codeConvert::convert()
outPtr/outLeft could potentionally overflow due to not ensuring space for termination.
const char* codeConvert::convert(const char* src)
{
#ifdef HAVE_ICONV
if (cd == (iconv_t) -1)
return src;
char *srcPtr = const_cast<char*>(src);
size_t srcLeft = strlen(src);
char *outPtr = buffer;
size_t outLeft = sizeof(buffer) - 1; ////////// sizeof makes the size automatically be correct, and -1 in order to guarantee that the terminator can fit
while (srcLeft > 0)
{
size_t ret = iconv(cd, &srcPtr, &srcLeft, &outPtr, &outLeft);
if (ret == (size_t) -1)
{
if (outLeft && srcLeft && (errno != E2BIG)) //////// if conversion stopped due to untranslatable character, skip it
{
*(outPtr++) = '_';
outLeft--;
srcPtr++;
srcLeft--;
} else {
break;
}
}
}
// flush
iconv(cd, NULL, &srcLeft, &outPtr, &outLeft);
// terminate buffer string
*outPtr = 0;
return buffer;
#else
return src;
#endif
}
Interesting, will check what happens on MinGW. In the meantime if you want to submit a PR for the buffer size fix it would be nice.
Will try to make PR tomorrow. Sitting in wrong OS with a different project ATM
Fun thing is that the iconv program performs the conversion correctly while using the library I get a question mark in place of ü. 😕
First question 1) where are you testing? 2) what kind of question mark?
The diamond-question mark usually appears if the console is utf-8 and you generate text that that is for instance cp850 or cp437 and you hit the characters that supposed to be the utf-8 multibyte sequence
If I modify the loop as suggested earlier, I get this (expected) behaviour when using a UTF-8 terminal in gnome/linux:
$ LANG="en_US.ASCII" ./stilview -l=/home/stian/sid/C64Music -e=/MUSICIANS/H/Huelsbeck_Chris/Great_Giana_Sisters.sid
---- GLOBAL COMMENT ----
COMMENT: H_lsbeck's own comments are denoted (CH).
More of Chris H_lsbeck's C64 music can be found in
$ LANG="en_US.UTF-8" ./stilview -l=/home/stian/SORT/ocp-test/sid/C64Music -e=/MUSICIANS/H/Huelsbeck_Chris/Great_Giana_Sisters.sid
---- GLOBAL COMMENT ----
COMMENT: Hülsbeck's own comments are denoted (CH).
More of Chris Hülsbeck's C64 music can be found in
I'm testing on MSYS2/MinGW64 installed on Windows 10. I get a normal question mark with CP850, the default Windows codepage.
I do not have MSYS2/MinGW64 development environment on my windows system - I do however have TortoiseGIT and GIT for windows with the command line and bash available running under MinGW64, and it is configured like this:
$ uname
MINGW64_NT-10.0-19045
$ export|grep LC
declare -x LC_CTYPE="en_GB.UTF-8"
And that you are seeing question marks makes be believe that your terminal is actually trying to decode the text as UTF-8.
CP850 text will get question marks if attempted to be decoded as UTF-8: ü = 0x81 in CP850, this would be byte 2, 3 or 4 if decoded as UTF-8-sequence - invalid as first byte => question mark Ü = 0x9a in CP850, this would be byte 2, 3 or 4 if decoded as UTF-8-sequence - invalid as first byte => question mark
$ uname
MINGW64_NT-10.0-19044
$ env|grep LC
LC_CTYPE=it_IT.CP850
BTW the code uses GetConsoleOutputCP() to get the codepage on Windows which can be set/read with chcp.com
With UTF-8 I get:
$ chcp.com 65001
Active code page: 65001
$ MSYS2_ARG_CONV_EXCL='-e=' src/stilview -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid
NAME: The Deep
AUTHOR: Chris Hülsbeck
COMMENT: Re-edit of /MUSICIANS/H/Huelsbeck_Chris/Compilation_III.sid, Tune #4.
LC_CTYPE and chcp must always match, and I expect chcp not to touch LC_CTYPE. I am not experience to know how well chcp.com codepage changes interact with console applications compiled under MSYS2, but I would expect it to work.
a) Do they match when you start the console/command line? I see that running chcp in cmd.exe returns the current active code page. b) Do you get the correct output if you ensure that they both match?
chcp
echo $LC_CTYPE
chcp 850
LC_CTYPE=it_IT.CP850 MSYS2_ARG_CONV_EXCL='-e=' src/stilview -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid
chcp 65001
LC_CTYPE=it_IT.UTF-8 MSYS2_ARG_CONV_EXCL='-e=' src/stilview -t=3 -e=/MUSICIANS/V/Vaca_Ramiro/Turrican.sid
I missed your comment about GetConsoleOutputCP() btw, should probably add a debug printf before iconv_open to verify the "encoding" variable is what you expect.
And verify that iconv_open is successfull! It might have failed, and that being your main problem
I think I've found it, it's the call to setlocale(LC_ALL, "");
that causes the problem.
Will rebase and fix it later.
Looks good now, thanks for the help
See (#16)