Open rrthomas opened 2 years ago
Strangely though, the bug doesn't happen with -ignore
. That has the side effect of making the input checked for bad byte sequences, and they are skipped.
perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | ./src/recode UTF-8..UTF-8-ignore
AαB
/me/recode-3.7.11/src/.libs/lt-recode: Invalid input in step `UTF-8..UTF8-ignore'
@epa, please could you let me know if this now works for you as advertised, and with documentation that makes sense? (Issue #37 remains to be dealt with.)
Hi, thanks for fixing this, sorry I can't test it yet because the current git version does not build for me on RHEL 8.5. help2man: no locale support (Locale::gettext required)
although I do have that Perl module.
The problem is most likely that help2man needs to be built with gettext support (this is documented!). Merely having the perl module installed is insufficient.
I'm testing the current version (62b996d09df36eae147362511cabfc3b2368dedd). I'm not sure it works yet:
% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' >in
% ./src/recode UTF-8..UTF-8 <in
[produces invalid UTF-8]
% ./src/recode --strict UTF-8..UTF-8 <in
[still produces invalid UTF-8]
% ./src/recode --force UTF-8..UTF-8 <in
[the same again]
% ./src/recode UTF-8..UTF-8-ignore <in
lt-recode: Request `UTF-8..UTF-8-ignore' is erroneous
In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.
I think that if the input is specified as UTF-8 then recode should check that, and die if the input is not valid UTF-8 -- but it can be useful to have a lax mode where junk bytes in the input are skipped as best you can.
I'm really sorry, @epa, I commented on the wrong bug above. I have not yet tried to fix this issue, I was asking for your comments on issue #38.
In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.
The reason that this does not happen currently is because of recode's conversion optimization. When you request a conversion utf-8..utf-8
, as you will see if you use --verbose
, this is reduced to a zero-step conversion, a "mere copy". Hence, no validation of the input is performed. I think this is the only case in which invalid output is produced.
If instead a conversion is forced, by e.g.
$ recode UTF-8..UCS-4..UTF-8 <in
Then the input is validated and the problem is found (and so no invalid output is produced).
Also, I notice that with --verbose
recode 3.7, unlike recode 3.6, prints the request out twice. I shall fix that.
Yes, I suspected it might be something like that. But surely the main reason why a user would run recode UTF-8..UTF-8
rather than just cat
is to get the extra validation that recode performs. So I think whatever optimization is done should not disable the error checks that prevent bad output.
It's tricky. First, recode might be invoked with UTF-8..UTF-8
as the result of some other computation, not because the user specifically requested it. Secondly, while recode is optimizing a conversion, it might generate this sort of step internally, and there's no reason to keep it.
Internally, recode has no validation of input or output that is separate from a transformation, unfortunately.
See #3. Recode versions 3.6, 3.7.9 and 3.7.11 all produce the same invalid output given invalid input:
Since the behaviour is clearly not new it will require some study to see why it behaves as it does (is it a long-standing bug? or deliberate? or a deep-seated design problem?).