Open wpoely86 opened 3 years ago
The language on this machine is set to en_US.UTF-8
@wpoely86 ncm-ncd probably doesn't respect that. see https://github.com/quattor/ncm-ncd/blob/master/src/main/scripts/ncm-ncd#L19
check the content of /proc/.../environ
to see what is used.
see if changing it there "fixes" things. we might make it configurable via the config file if that is the case. but not sure what the impact is to make it default.
I've added:
LANG => "en_US.UTF-8",
LC_ALL => "en_US.UTF-8",
but no change.
use utf8;
in ncm-ncd doesn't change a thing neither.
I've checked in the configure
of metaconfig and it definitely gets LANG="en_US.UTF-8"
what happens in debug=5? does it print unicode or not? it is also posisble that it is ccm-fetch that (also?) ignores the lang, so the data in the db is already wrong; nohting that ncm-ncd could fix
Yes, it's already wrong in the xml under /var/lib/ccm
. Trying to insert the locale there too
Didn't help. If I run it with --debug 5
, the first time it's printed is already wrong:
[DEBUG] AddPath: /software/components/metaconfig/services/.../contents/foo => 1937 => Bruxelles-Capitale\, R?gion de
OK, it's apache :( Even if I fetch it with curl, it's wrong.
OK, cancel that. It's not apache. The actual profile is wrong. It has the correct output on my laptop but not on the server.
We're getting there!
OK, I thought I had it. A LANG=C
in the script use to build the profiles but alas.
Now at least the xml file with the profile under /var/lib/ccm
is correct.
Now I get:
Bruxelles-Capitale\, R<E9>gion de
After adding the LANG
and LC_ALL
to ncm-ncd
it turns into:
ST=Bruxelles-Capitale\, Région de
but the actual program that needs to read the file now says:
non-UTF-8 character at line 10
Python also complains on the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 415: invalid continuation byte
So somehow, it manages to write out something almost correct yet invalid UTF-8?
what do you mean when you say 'it turns into': is that logging, is that from cat'ing the file?...
That's the weird part. If put it through cat
, it gives:
Bruxelles-Capitale\, R�gion de
(same as in debug output of ncm-ncd --co metaconfig).
But, if I open the file in vim. I see the correct output. Doing :set fileencoding=utf-8
and a write fixes the issue. After that the file is valid UTF-8.
So whatever gets written out is valid in something, but not UTF-8 (also not UTF-16).
vim claims: fileencoding=latin1
. But that should also be valid UTF-8?
OK, seems not. Running iconv -f latin1 -t utf-8
also fixes the issue on the file.
So despite the LANG
the files are written in latin1
.
i'm quite sure CAF can't produce utf-8 valid code atm. in the hasref on https://github.com/quattor/CAF/blob/master/src/main/perl/FileWriter.pm#L327 you need to add
binmode_layer => ':utf8',
(from the https://metacpan.org/pod/File::AtomicWrite, look for binmode_layer)
Yep, that fixes it! Thanks for the help.
OK, so what's the proper way to fix this? Can we add this in the general case? As it's now producing latin1, I think so?
ccm-fetch --force
and rerun ncm-ncd --co meta...` )ncm-ncd
(what are the minimal variables?); but this would also pass it on to CAF to do the right thingApologies for the delay.
If I run without any changes:
/var/lib/ccm
has the correct unicode stringncm-ncd --co metaconfig
writes out the file in latin1
encoding. Doing iconv -f latin1 -t utf-8
gives me a valid unicode file that is accepted by the program.Putting a LANG
and LC_ALL
variable to en_US.UTF-8
in ncm-ncd
doesn't change a thing.
So I think that only thing needed is a way to set the binmode thingie to unicode.
How does one deal with unicode in ncm-metaconfig?
I have something like:
In the machine profile I see:
but when it's written to the file I get:
Any hints how to fix this? @stdweird @jrha ?
escape()
doesn't help.