Recode produces invalid UTF-8 given invalid UTF-8

rrthomas commented 2 years ago

See #3. Recode versions 3.6, 3.7.9 and 3.7.11 all produce the same invalid output given invalid input:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | recode UTF-8..UTF-8
\200AαB

Since the behaviour is clearly not new it will require some study to see why it behaves as it does (is it a long-standing bug? or deliberate? or a deep-seated design problem?).

epa commented 2 years ago

Strangely though, the bug doesn't happen with -ignore. That has the side effect of making the input checked for bad byte sequences, and they are skipped.

perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | ./src/recode UTF-8..UTF-8-ignore
AαB
/me/recode-3.7.11/src/.libs/lt-recode: Invalid input in step `UTF-8..UTF8-ignore'

rrthomas commented 2 years ago

@epa, please could you let me know if this now works for you as advertised, and with documentation that makes sense? (Issue #37 remains to be dealt with.)

epa commented 2 years ago

Hi, thanks for fixing this, sorry I can't test it yet because the current git version does not build for me on RHEL 8.5. help2man: no locale support (Locale::gettext required) although I do have that Perl module.

rrthomas commented 2 years ago

The problem is most likely that help2man needs to be built with gettext support (this is documented!). Merely having the perl module installed is insufficient.

epa commented 2 years ago

I'm testing the current version (62b996d09df36eae147362511cabfc3b2368dedd). I'm not sure it works yet:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' >in
% ./src/recode UTF-8..UTF-8 <in
[produces invalid UTF-8]
% ./src/recode --strict UTF-8..UTF-8 <in
[still produces invalid UTF-8]
% ./src/recode --force UTF-8..UTF-8 <in
[the same again]
% ./src/recode UTF-8..UTF-8-ignore <in
lt-recode: Request `UTF-8..UTF-8-ignore' is erroneous

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

I think that if the input is specified as UTF-8 then recode should check that, and die if the input is not valid UTF-8 -- but it can be useful to have a lax mode where junk bytes in the input are skipped as best you can.

rrthomas commented 2 years ago

I'm really sorry, @epa, I commented on the wrong bug above. I have not yet tried to fix this issue, I was asking for your comments on issue #38.

rrthomas commented 2 years ago

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

The reason that this does not happen currently is because of recode's conversion optimization. When you request a conversion utf-8..utf-8, as you will see if you use --verbose, this is reduced to a zero-step conversion, a "mere copy". Hence, no validation of the input is performed. I think this is the only case in which invalid output is produced.

If instead a conversion is forced, by e.g.

$ recode UTF-8..UCS-4..UTF-8 <in

Then the input is validated and the problem is found (and so no invalid output is produced).

Also, I notice that with --verbose recode 3.7, unlike recode 3.6, prints the request out twice. I shall fix that.

epa commented 2 years ago

Yes, I suspected it might be something like that. But surely the main reason why a user would run recode UTF-8..UTF-8 rather than just cat is to get the extra validation that recode performs. So I think whatever optimization is done should not disable the error checks that prevent bad output.

rrthomas commented 2 years ago

It's tricky. First, recode might be invoked with UTF-8..UTF-8 as the result of some other computation, not because the user specifically requested it. Secondly, while recode is optimizing a conversion, it might generate this sort of step internally, and there's no reason to keep it.

Internally, recode has no validation of input or output that is separate from a transformation, unfortunately.

rrthomas / recode

Recode produces invalid UTF-8 given invalid UTF-8 #37