rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

Better documentation for //IGNORE with iconv #3

Closed epa closed 2 years ago

epa commented 6 years ago

(copied from https://github.com/pinard/Recode/issues/14)

Sometimes recode dies with 'Invalid input'. An --ignore-invalid flag would do whatever needed to skip over junk bytes in the input, recovering whatever valid text can be found. Of course, there is more than one way to decide what to skip when decoding a multibyte encoding, so it would have to pick something broadly sensible.

I'm not envisaging a fully specified decoding for all possible junk input sequences in all possible encodings, just a best effort to extract whatever usable text remains. For UTF-8, having just read an invalid byte sequence, it could discard the first byte of the sequence and try again.

Des-Nerger commented 3 years ago

Wouldn't it be the same as --force?

rrthomas commented 3 years ago

@Des-Nerger No, --force allows recoding (of valid input) when the result is not reversible. The proposed --ignore-invalid is about skipping invalid input.

rrthomas commented 2 years ago

This can now be achieved in recode 3.7.11 with iconv, using *-ignore charsets (iconv option //IGNORE).

epa commented 2 years ago

Thanks for the update. Could you give an example of the new recode usage? And are you sure that it skips invalid byte sequences, and not just unknown characters? (In other words the 'invalid input' error would never occur, even given UTF-8 input that had occasional junk bytes mixed in.)

I ask because on my reading of the iconv documentation,

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

it seems to be about after the input byte sequence has been decoded into characters -- while this feature request is for recode to do some kind of retrying when it encounters errors in decoding the raw bytes.

rrthomas commented 2 years ago

@epa, probably the best way to approach this is to try with iconv first, and then check that recode replicates its behaviour. I don't pretend to fully understand what iconv does with //IGNORE, and my limited experiments suggested that the answer is "not much", at least with glibc iconv on my Ubuntu machine.

epa commented 2 years ago

Thanks, it appears that iconv //IGNORE does it, and the documentation is inaccurate. To start with let's generate a byte sequence that gives AαB in UTF-8, and see that iconv handles it:

% perl -C0 -E 'say chr(65), chr(206), chr(177), chr(66)' | iconv -f UTF-8 -t UTF-8
AαB

Now we deliberately add a junk byte at the start:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | iconv -f UTF-8 -t UTF-8
iconv: illegal input sequence at position 0

But with //IGNORE, iconv will keep going:

perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | iconv -f UTF-8 -t UTF-8//IGNORE
AαB
iconv: illegal input sequence at position 6

although I don't understand why it reports the error at position 6 rather than position 0.

It appears that //IGNORE in iconv has two effects: to keep trying on junk bytes (the original purpose of this feature request), and after decoding, to skip characters which don't exist in the target character set. But you can disentangle the two by converting from an encoding to itself, as in the above example converting from UTF-8 to UTF-8.

rrthomas commented 2 years ago

Thanks very much for your feedback! Good news that at least iconv is doing what you want; are you able to test recode 3.7.11; and separately, would you like to suggest what the docs should say?

epa commented 2 years ago

Thanks for the update. Testing recode 3.7.11, it appears to pass through the junk byte unchanged, generating invalid UTF-8:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | ./src/recode UTF-8..UTF-8
\200AαB

That looks like a bug to me: no matter what the input, if recode is asked to produce UTF-8 output then it must produce valid UTF-8 or die trying. The original feature request was to skip the garbage bytes somehow and make a best effort to produce some valid output despite them, because older versions of recode were strict and would die on invalid byte sequences, but it appears that recode has gone too far in the other direction.

As for the documentation of iconv, it's difficult for me to say because I consider that the behaviour of iconv //IGNORE is not very useful. It combines two things which should be orthogonal: how to handle garbage bytes in the input (die, or make a best effort to skip them), and once the input has been decoded into characters, how to handle characters which can't be represented in the output encoding (die, or skip them and output the rest).

If your program is meant to receive UTF-8 input, and it then needs to convert that to Latin-1 for output to some old printer (for example), then you might want to skip and continue if there are input characters you can't handle. However, you'd still want to die with a useful message if the input just isn't valid UTF-8.

The other way round, I believe the original motivation for this feature request was scraping text from websites. The website might not be very well programmed and might mix bits of other encodings with its normal UTF-8 text, giving essentially indecipherable junk bytes sprinkled through the text. I wanted my program to be robust to those, however that doesn't necessarily mean that I wanted to forgo the check of legal characters when converting to my final output encoding.

I think iconv (and recode) should have a flag to handle badly encoded input, as well as possible, and this flag is independent of the chosen target encoding. It could be given with some special syntax on the input encoding, but in my opinion the names of character encodings are confusing enough already, so I'd prefer an entirely separate flag.

Then it can have a way to silently drop characters which can't be represented in the chosen output encoding. Personally I'd like a separate flag for that too, it seems more user-friendly. ("Bad character xyz in output; use --skip-unencodable to suppress this error") but it could also be done with a string appended to the name of the encoding.

rrthomas commented 2 years ago

Thanks for this in-depth analysis; I have opened #37 and #38.

I have found good documentation for the //IGNORE and //TRANSLIT options in iconv(1), and have used that for the Recode manual.