Option to have jq generate an error on invalid UTF-8

svdb0 commented 1 year ago

When jq receives a string as input which contains an invalid UTF-8 byte sequence, it will silently replace invalid sequences by UNICODE replacement characters:

$ printf '["foo", "\x96\xdd", "bar"]' | jq -c .
["foo","��","bar"]

$ printf '["foo", "\x96\xdd", "bar"]' | jq -c . | xxd
00000000: 5b22 666f 6f22 2c22 efbf bdef bfbd 222c  ["foo","......",
00000010: 2262 6172 225d 0a                        "bar"].

Silently rewriting data in a one-way way means a loss of information, which I'd argue should be avoided in general, but it can also be a security issue:

If two systems rewrite data in a different way — say for instance one using replacement characters and the other dropping the invalid byte sequence — then validation could take place using one interpretation, while processing occurs using the other, allowing bad data to enter the system.

Also, two strings may compare equal even when they weren't originally.
I am reminded of the case where due to a misconfigured encoding, all non-ASCII characters in a password would be replaced by replacement characters, which made Cyrillic passwords particularly predictable.

I would propose that an option be added to jq for it to produce an error (catchable as with the error function) for each input which is not a valid UTF-8 string, instead of replacing invalid byte sequences by UNICODE replacement characters.

Following the principle of 'secure by default', I would furthermore argue that the option to error on invalid input should be the default behaviour; lowering security should always be a concious decision.

nicowilliams commented 1 year ago

I would propose that an option be added to jq for it to produce an error (catchable as with the error function) for each input which is not a valid UTF-8 string, instead of replacing invalid byte sequences by UNICODE replacement characters.

Note that we can only allow validation errors to be catchable when using input, inputs, and fromjson (or any similar builtins we add later), which essentially means you'll need to use the -n option to jq.

svdb0 commented 1 year ago

That is a good point.

I was indeed thinking about more finegrained error handling, but hadn't thought this through.
It could still be implemented by marking a string as invalid during parsing, while only raising an error when the string would be accessed in a filter, or when it is written to an output stream.
I guess you're halfway to implementing a binary type by then.

Still, even without being able to catch encoding errors in such a fine-grained way, you could often make jq do what you want if it had an option to specify what to do with encoding errors.
I'm thinking of something like this:

--encoding-errors=remove — remove invalid byte sequences from strings
--encoding-errors=replace — use UNICODE replacement characters for invalid byte sequences
--encoding-errors=truncate — truncate the string at the first invalid byte sequence
--encoding-errors=omit — omit invalid strings from the stream
--encoding-errors=next — skip the entire input
--encoding-errors=error — raise an error for the input
--encoding-errors=halt — stop jq altogether

All except halt could be a security risk under the right (wrong) circumstances[^1], but I think there would would be contexts for each where they would be what you want.
(e.g. remove for user output, replace for log entries; truncate for displaying carved data where garbage may follow; omit when your strings are independent and you're only dealing with a sequence of strings and you want to ignore bad strings; next when all inputs are independent and you wish to ignore bad inputs; error when all inputs are independent and you wish to be notified about bad entries while continuing; halt when you do not want to continue with even a single bad string)
That said, often you'll have strings in different contexts in your input, for which errors could not be treated all the same. Then the only proper way to deal with encoding errors would be next, error or halt.

halt would be the safe default, but a shorter form for --encoding-errors=error may be desirable for command-line use. Something like -c (for 'continue on errors') maybe.

[^1]: If you're considering availability, even halt could be a security risk.

svdb0 commented 1 year ago

Wild thought: there could be an operating mode where errors aren't raised when an error condition is encountered, but instead result in some error object being written to the output stream of the filter. Any operation on an error object would result in a new error object.

Only when an error object is written to the output stream of the final filter, would the error be reported.
This would make it possible to collect errors and handle them all at once, and also to ignore errors in parts of expressions which are not used in the output. e.g. [123, 1/0][0].

It could even be something which you could enable within a certain block scope. Something like

propagate_errors [123, 1/0][0] end

Only if the output of such a block contains an error object, would an error be raised, while within the block errors would be propagated through expressions as error objects.

A few new functions like iserror (returns true for values which are error objects), omit_errors (as a shortcut for select(if (. | iserror) then empty else . end)), replace_errors (to recursively replace error objects by null or some specified value) and break_error (to exit the block while raising an error — may need a better name) would come in handy then.

svdb0 commented 1 year ago

The current 'silent replacement' behaviour can give unexpected results even within jq itself:

$ printf '{ "foo\x80bar": 123, "foo\x81bar": 456 }' | jq .
{
  "foo�bar": 456
}

nicowilliams commented 1 year ago

Wild thought: there could be an operating mode where errors aren't raised when an error condition is encountered, but instead result in some error object being written to the output stream of the filter.

That's a non-starter I think because... what would that error object be? I guess maybe if you provide it on the command line that would work?

But the easier thing to do would be for inputs to raise an error (exception) when it reads invalid JSON/UTF-8 (which would be a feature enabled by a command-line option, or maybe different inputs-like builtins), and then you could handle the exception any way you want in your jq code.

Also, we want to avoid an explosion of command-line arguments, so I think this would be best as an inputs/1 that takes a description of how to handle all this stuff, and same for fromjson, and then you can just run jq -n 'inputs({invalid_utf8:"error"})|...'.

svdb0 commented 1 year ago

That's a non-starter I think because... what would that error object be? I guess maybe if you provide it on the command line that would work?

When I say 'error object', I don't mean a JSON object.
It would be a new type, similar to, but distinct from the JSON types, which would only be used internally within jq, and which would never end up in the final output.

To illustrate, if you create an array like this

[1, {"foo": 2 + 1/0}, "foo\x80bar" + "!", [].foo, true, error("Ouch")]

The result would be an error object, which I imagine jq would represent internally as something like

Error(
    Array(
        Number(1),
        Error(
            Object(
                Property(
                    String("foo"),
                    ErrorValue("Division by zero")
                )
            ),
        ),
        ErrorValue("Byte sequence could not be decoded as UTF-8"),
        ErrorValue("Cannot index array with string"),
        Boolean(true),
        ErrorValue("Ouch")
    ),
)

(After giving it some more thought, I introduced Error wrappers here in addition to the ErrorValue object containing the message, so that an iserror function would not need to recurse through the structure.)
(I'd also prefer error subtypes for different types of errors, rather than strings, but I'm mirroring how the jq language uses strings as errors.)

(Sub-)expressions which would otherwise raise an error, evaluate to an error object. Trying to operate on these error objects would result in an error object (probably the original). And error objects could be incorporated in jq objects and arrays, so that the original structure is kept intact.

Error objects would never be seen in input or output; when one is written to the final output stream, jq would report the errors as it does now with uncaught errors.

I imagine that if you were to filter the above error object through a replace_errors(null), the resulting JSON array becomes

[1, { "foo": null }, null, null, true, null]

with no more error objects in the internal representation.

And maybe you could do collect_errors | map(error_string) to produce:

[
    "Division by zero",
    "Byte sequence could not be decoded as UTF-8",
    "Cannot index array with string"
]

But the easier thing to do would be for inputs to raise an error (exception) when it reads invalid JSON/UTF-8 (which would be a feature enabled by a command-line option, or maybe different inputs-like builtins), and then you could handle the exception any way you want in your jq code.

That doesn't give you the fine-grained control though.
If somewhere deep inside your JSON structure there is an invalid string, you know that your JSON is invalid in some way, but that's pretty much where it ends.
If you catch it, you can decide to write an error message or not, but you can't decide to continue with a null value in the place of the invalid string.

But as I said, it was just a wild thought.
I expect that the effort involved in implementing propagating error objects would not be worth the gains.

Also, we want to avoid an explosion of command-line arguments, so I think this would be best as an inputs/1 that takes a description of how to handle all this stuff, and same for fromjson, and then you can just run jq -n 'inputs({invalid_utf8:"error"})|...'.

Interesting idea.
I like that the option would be local to where it is used.

It would however require that you use inputs, and a rather verbose syntax, even for simple filters on the command line.

The harder it is to change the default behaviour, the less likely it is that people will bother. This would make it even more important that the default is secure, and the most secure behaviour would be to stop when an error is encountered.
But often, especially when running from the command line, you'll want to continue after errors.
A one-letter option for this would be welcome then.
(I suggested -c in an earlier comment, but that's obviously taken.)

And even if a less secure mode were to be made the default, a cautious person would like to be able to exit on the first error in a similarly convenient way.

nicowilliams commented 1 year ago

I like that the option would be local to where it is used.

Yes, that's very appealing to me.

It would however require that you use inputs, and a rather verbose syntax, even for simple filters on the command line.

Meh. I'd rather make -n the default, get rid of -s, -r, -R, etc., and let the jq application do what it will via suitable builtins. Not that we can do that now, but if we provide all the necessary builtins for that, then users will have many more "options" than the command-line provides, and we could even have the command-line be implemented in jq code, leaving very little to code in C besides some builtins, the compiler, the linker, and the VM.

wader commented 1 year ago

Meh. I'd rather make -n the default, get rid of -s, -r, -R, etc., and let the jq application do what it will via suitable builtins. Not that we can do that now, but if we provide all the necessary builtins for that, then users will have many more "options" than the command-line provides, and we could even have the command-line be implemented in jq code, leaving very little to code in C besides some builtins, the compiler, the linker, and the VM.

fq re-implement most of jq's CLI interface this way, see https://github.com/wader/fq/blob/master/pkg/interp/init.jq#L242-L270. It's a bit more complicated than just inputs | eval(<query>) | tojson etc because of various error handling and performance reasons. But in general it had felt quite nice to implement in jq, actually i remember it was a mess when jq code was not in "control".

wader commented 1 year ago

... now i remembered that jqjq actually does exactly this https://github.com/wader/jqjq/blob/master/jqjq.jq#L2375-L2387

nicowilliams commented 1 year ago

@wader excellent!! So maybe we should focus on making all strictness and input/output format options just more builtins and call it a day. We should decide on this (or not) and then document our intent for this somewhere, probably in a README in the repository, that way we don't forget these decisions.

svdb0 commented 1 year ago

I like that the option would be local to where it is used.

Yes, that's very appealing to me.

It would however require that you use inputs, and a rather verbose syntax, even for simple filters on the command line.

Meh. I'd rather make -n the default, get rid of -s, -r, -R, etc., and let the jq application do what it will via suitable builtins. Not that we can do that now, but if we provide all the necessary builtins for that, then users will have many more "options" than the command-line provides, and we could even have the command-line be implemented in jq code, leaving very little to code in C besides some builtins, the compiler, the linker, and the VM.

I like this idea. When I first learned about -n, I also thought that it would be nicer if that were the default.
Though I'd suggest aliasing in to inputs, as that's going to be typed a lot. Also in (to me) better represents the idea that it's the input stream, rather than a collection (array) of inputs.

You could still do this, even now. The command line program would just have to be named differently. Say qj or jn (for jq -n). Then jq could become a wrapper around this new interface.

nicowilliams commented 1 year ago

You could still do this, even now. The command line program would just have to be named differently. Say qj or jn (for jq -n). Then jq could become a wrapper around this new interface.

I've thought of this, and I'm not too keen, though I'm open to it. Another option I've considered is a -2 for 2.0+, and another is that when -f program-file is given we could get the desired jq version from the module metadata.

jqlang / jq

Option to have jq generate an error on invalid UTF-8 #2660