`InvalidUri` errors could probably include the invalid bytes

hawkw commented 1 year ago

The InvalidUri error type is returned when trying to parse an input as a URI, if the bytes are not a valid URI. This error type does not contain the value that we were trying to parse, but does indicate how the input was invalid.

Since all methods of parsing a value as a Uri either take a Bytes or convert the input (string or byte slice) into a Bytes, the error type probably could include the input, and format it as part of their fmt::Display/fmt::Debug implementations. Uri::from_shared takes a Bytes by value, so returning it in the error value shouldn't have a significant performance cost --- if we're trying to parse a URI from an &str or &[u8], we will have already copied it in order to convert it into a Bytes, and if the input was itself a Bytes, the reference count has already been incremented in order to call Uri::from_shared.

This would make it much easier for an application to display the invalid input to the user, which is desirable, especially in cases where the invalid URI came from user input.

LucioFranco commented 1 year ago

Would it make sense in the display to limit the amount of viewable bytes? That said, I think this makes sense to me.

hawkw commented 1 year ago

Would it make sense in the display to limit the amount of viewable bytes? That said, I think this makes sense to me.

Yeah, we probably want to have some kind of "common sense limit" on how much of the URI we print...

seanmonstar commented 1 year ago

On the one hand, better errors are a great user-centric improvement, I love em! If there's a way to improve it for users, even just some of the time, that'd be wonderful.

On the other hand, here's some concerns we'd need to overcome:

Some of the parsing methods take Bytes, which is cheap to clone, or even just own, similar to the TryFrom<String>. But those taking a slice, there'd be a cost to copying into a new vec to store in the error.
If it were a Bytes, would storing a clone of it in an error message make the original buffer possibly live way longer than one assumed (considering the Bytes could be a small slice of a bigger one that took a big read from the socket)?
- Would it be a case for a Bytes method that forced a copy if the diff were big enough?
The parse data usually comes from an outsider, would printing that data as part of the error message have any security considerations? I imagine the answer is a tentative yes, as in "people could screw up and print it in a way that an attacker can do bad things", so it wouldn't be our fault, but it could be making it easier for some to be pwned.
Likewise from a privacy point of view: would the data possibly have PII that people logging these errors weren't expecting?

hawkw commented 1 year ago

Some of the parsing methods take Bytes, which is cheap to clone, or even just own, similar to the TryFrom<String>. But those taking a slice, there'd be a cost to copying into a new vec to store in the error.

Hmm, don't the parsing methods that take a slice convert the slice into a Bytes before trying to parse it, anyway? E.g. https://github.com/hyperium/http/blob/34a9d6bdab027948d6dea3b36d994f9cbaf96f75/src/uri/mod.rs#L713-L715

If I understand the code correctly, it looks like all paths that construct a Uri from a &[u8] or an &str will already copy the bytes into a Bytes before trying to parse, and then the Bytes is dropped if the URI is invalid? Or am I misunderstanding something here?

The parse data usually comes from an outsider, would printing that data as part of the error message have any security considerations? I imagine the answer is a tentative yes, as in "people could screw up and print it in a way that an attacker can do bad things", so it wouldn't be our fault, but it could be making it easier for some to be pwned.

Likewise from a privacy point of view: would the data possibly have PII that people logging these errors weren't expecting?

This is a valid concern --- perhaps we would want to add a method to the InvalidUri error to access the input bytes, rather than always including it in the Debug implementation? That way, the application can choose whether or not to actually log the URI value. The downside is that this would probably mean that, for hyper users, hyper would need to actually expose the http::uri::InvalidUri as a source and/or downcast target for its error types, or else there would be no way for the application to explicitly choose to display the invalid input...

hawkw commented 1 year ago

To elaborate on the motivation here a bit, the particular use-case I'm thinking of with this feature request is one where the errors are being returned by a hyper server. In that case, there isn't currently a good way for the application to choose whether or not it wants to display the invalid input, because the buffer that the bytes were read into is owned by hyper, and it's hyper's HTTP parser that's calling into http::Uri's parsing functions, rather than the application itself. In that case, the application can't easily access the portion of the buffer that contained the invalid URI, without doing something really unpleasant like wrapping the underlying IO resource to buffer all the bytes, and essentially reimplementing most of the HTTP parser just to find potentially invalid URIs.

For applications where the primary interaction with URI parsing is passing a user input that might be a URI into hyper, it would be much easier for the application to just hang onto the input and choose to log it if hyper returns an invalid URI error. So, I'm mainly concerned about the server use case here.

seanmonstar commented 1 year ago

don't the parsing methods that take a slice convert the slice into a Bytes before trying to parse it, anyway?

Uh, well, woops. That could be improved...

To elaborate on the motivation here a bit, the particular use-case I'm thinking of with this feature request is one where the errors are being returned by a hyper server.

Making better error messages is already splendid motivation, I didn't mean to seem against the idea. I think we should always help people understand what went wrong. Just wanted to make sure we didn't miss anything while doing so.

hawkw commented 1 year ago

don't the parsing methods that take a slice convert the slice into a Bytes before trying to parse it, anyway?

Uh, well, woops. That could be improved...

Hmm, if we're planning to change those functions to only copy the bytes into a Bytes if the URI is valid, I suppose we could make the input component of the error optional, and only include it when the URI was constructed from a Bytes or an already-owned String etc? But I think that's probably only a good idea if we're going to include the input in the Display output, rather than allowing explicit access to it, since it seems a bit weird for the error to sometimes contain the invalid input and sometimes not contain it, depending on what type the URI was parsed from...

To elaborate on the motivation here a bit, the particular use-case I'm thinking of with this feature request is one where the errors are being returned by a hyper server.

Making better error messages is already splendid motivation, I didn't mean to seem against the idea. I think we should always help people understand what went wrong. Just wanted to make sure we didn't miss anything while doing so.

We're in agreement on that --- you didn't come off as against it at all! I also want to consider all the possible reasons we shouldn't do this. I mainly thought it was worth explaining the motivation in order to see whether there might be other ways to surface invalid inputs in that don't have the same drawbacks as always including it in the error (e.g. some kind of new API in Hyper?)...

hawkw commented 1 year ago

Quick ping on this, I'd be happy to open a PR to add it if we can come to consensus on the constraints!

hyperium / http

`InvalidUri` errors could probably include the invalid bytes #571