rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

Error handling with //IGNORE (iconv) #38

Closed rrthomas closed 2 years ago

rrthomas commented 2 years ago

(See #3.) The iconv(1) man page says:

If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.

This is indeed what happens with both iconv and recode, currently. Arguably, recode should not emit an error unless in --strict mode in this case. I think this means we want to ignore an EILSEQ return code from iconv unless we're in strict mode.

rrthomas commented 2 years ago

I have this the wrong way around: in --strict mode, we should use //IGNORE (because --strict discards untranslatable input). However, we should return an error in this case unless we also use --force.

rrthomas commented 2 years ago

@epa, please could you let me know if this now works for you as advertised, and with documentation that makes sense? (Sorry for the confusion over which issue I was asking for feedback about.)

epa commented 2 years ago

Hi, remembering to test the current git version this time, I get

% perl -C -E 'say chr(65), "\N{GREEK SMALL LETTER ALPHA}", chr(66)' >in
% ./src/recode UTF-8..ASCII <in && echo yes
Alt-recode: Untranslatable input in step `ISO-10646-UCS-2..ANSI_X3.4-1968'
% ./src/recode --strict UTF-8..ASCII <in && echo yes
Alt-recode: Untranslatable input in step `ISO-10646-UCS-2..ANSI_X3.4-1968'
% ./src/recode --force UTF-8..ASCII <in && echo yes
AB
yes

So --force is to quietly skip characters which can't be represented in the output encoding, and to not exit with failure just because of them. --strict seems to be no change to the default behaviour.

rrthomas commented 2 years ago

Thanks very much, this is not quite working right yet, so your feedback is most useful. In the case of --strict it should still give the output AB but with error, and when --force is added, the same output and no error.

epa commented 2 years ago

Is it too late to change the names of these options?

rrthomas commented 2 years ago

I agree they're not obvious, but they are very old, so I fear it is too late.

rrthomas commented 2 years ago

I repeated your tests, and I observed that you did not use iconv. When I do, adding the --prefer-iconv flag, recode behaves as expected: with --strict it produces output then errors at the end, and without it gives up at the first untranslatable input.

The difference with the way recode behaves when not using iconv is unfortunate, but it's a consequence of the way iconv works: when it finds untranslatable input, iconv reports an error at the end of conversion, whereas recode's native conversions stop immediately unless --force is used.

In any case, I'll consider this issue closed and make a release. Thanks once again for your help!