unicode output with ncm-metaconfig

wpoely86 commented 3 years ago

How does one deal with unicode in ncm-metaconfig?

I have something like:

prefix '/software/components/metaconfig/services/{/etc/file.conf}';
'module' = 'tiny';
'contents/foo' = 'Bruxelles-Capitale\, Région de';

In the machine profile I see:

              "foo": "Bruxelles-Capitale\\, Région de",

but when it's written to the file I get:

foo=Bruxelles-Capitale\, R?gion de

Any hints how to fix this? @stdweird @jrha ? escape() doesn't help.

wpoely86 commented 3 years ago

The language on this machine is set to en_US.UTF-8

stdweird commented 3 years ago

@wpoely86 ncm-ncd probably doesn't respect that. see https://github.com/quattor/ncm-ncd/blob/master/src/main/scripts/ncm-ncd#L19 check the content of /proc/.../environ to see what is used.

see if changing it there "fixes" things. we might make it configurable via the config file if that is the case. but not sure what the impact is to make it default.

wpoely86 commented 3 years ago

I've added:

    LANG => "en_US.UTF-8",
    LC_ALL => "en_US.UTF-8",

but no change.

wpoely86 commented 3 years ago

use utf8;

in ncm-ncd doesn't change a thing neither.

wpoely86 commented 3 years ago

I've checked in the configure of metaconfig and it definitely gets LANG="en_US.UTF-8"

stdweird commented 3 years ago

what happens in debug=5? does it print unicode or not? it is also posisble that it is ccm-fetch that (also?) ignores the lang, so the data in the db is already wrong; nohting that ncm-ncd could fix

wpoely86 commented 3 years ago

Yes, it's already wrong in the xml under /var/lib/ccm. Trying to insert the locale there too

wpoely86 commented 3 years ago

Didn't help. If I run it with --debug 5, the first time it's printed is already wrong:

[DEBUG] AddPath: /software/components/metaconfig/services/.../contents/foo => 1937 => Bruxelles-Capitale\, R?gion de

wpoely86 commented 3 years ago

OK, it's apache :( Even if I fetch it with curl, it's wrong.

wpoely86 commented 3 years ago

OK, cancel that. It's not apache. The actual profile is wrong. It has the correct output on my laptop but not on the server.

We're getting there!

wpoely86 commented 3 years ago

OK, I thought I had it. A LANG=C in the script use to build the profiles but alas. Now at least the xml file with the profile under /var/lib/ccm is correct.

Now I get:

Bruxelles-Capitale\, R<E9>gion de

After adding the LANG and LC_ALL to ncm-ncd it turns into:

ST=Bruxelles-Capitale\, Région de

but the actual program that needs to read the file now says:

non-UTF-8 character at line 10

wpoely86 commented 3 years ago

Python also complains on the file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 415: invalid continuation byte

So somehow, it manages to write out something almost correct yet invalid UTF-8?

stdweird commented 3 years ago

what do you mean when you say 'it turns into': is that logging, is that from cat'ing the file?...

wpoely86 commented 3 years ago

That's the weird part. If put it through cat, it gives:

Bruxelles-Capitale\, R�gion de

(same as in debug output of ncm-ncd --co metaconfig).

But, if I open the file in vim. I see the correct output. Doing :set fileencoding=utf-8 and a write fixes the issue. After that the file is valid UTF-8.

So whatever gets written out is valid in something, but not UTF-8 (also not UTF-16).

wpoely86 commented 3 years ago

vim claims: fileencoding=latin1. But that should also be valid UTF-8?

wpoely86 commented 3 years ago

OK, seems not. Running iconv -f latin1 -t utf-8 also fixes the issue on the file.

So despite the LANG the files are written in latin1.

stdweird commented 3 years ago

i'm quite sure CAF can't produce utf-8 valid code atm. in the hasref on https://github.com/quattor/CAF/blob/master/src/main/perl/FileWriter.pm#L327 you need to add

binmode_layer => ':utf8',

(from the https://metacpan.org/pod/File::AtomicWrite, look for binmode_layer)

wpoely86 commented 3 years ago

Yep, that fixes it! Thanks for the help.

OK, so what's the proper way to fix this? Can we add this in the general case? As it's now producing latin1, I think so?

stdweird commented 3 years ago

can't change anything by default
i think no changes are needed to CCM (the json coding handles things ok, and the rest is the CDB or whatver db). please confirm this (remove the LANG stuff from ccm-fetch, do a ccm-fetch --force and rerun ncm-ncd --co meta...` )
in CAF, we can pass the binmode thingie if the LANG is set to somethig utf-ish (but should look possible values etc)
make LANG and others configurable in the ncm configfile, so the minimal required variables are set in ncm-ncd (what are the minimal variables?); but this would also pass it on to CAF to do the right thing

wpoely86 commented 3 years ago

Apologies for the delay.

If I run without any changes:

CCM does the right thing. The profile under /var/lib/ccm has the correct unicode string
ncm-ncd --co metaconfig writes out the file in latin1 encoding. Doing iconv -f latin1 -t utf-8 gives me a valid unicode file that is accepted by the program.

Putting a LANG and LC_ALL variable to en_US.UTF-8 in ncm-ncd doesn't change a thing.

So I think that only thing needed is a way to set the binmode thingie to unicode.

quattor / configuration-modules-core

unicode output with ncm-metaconfig #1494