Closed InsertCreativityHere closed 4 months ago
This is not a bug. The URI spec defines all these terms. See for example https://en.wikipedia.org/wiki/Percent-encoding.
1) The broader point of my comment was: "an error message which requires niche domain knowledge or reading a spec isn't a great error message". But this is a niche error, so it's fine. Listing the characters individually would be a pain.
2) After reading the spec, I still believe the error message is in fact technically incorrect, and even worse, I think our logic is incorrect too. TL;DR these characters shouldn't be valid in a fragment "#" / "[" / "]"
, but our comment suggests they are (these are in the reserved character set), and our implementation doesn't seem to reject them either.
Taking the definitions from the spec, a valid fragment is:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Expanding this gives us that a fragment can contain:
fragment = *( unreserved / pct-encoded / sub-delims / ":" / "@" / "/" / "?" )
NOTE: HEXDIG
is contained within the unserved
character set, so pct-encoded
only gives us a %
here
fragment = *( unreserved / sub-delims / "%" / ":" / "@" / "/" / "?" )
However, our error message claims that a fragment can contain:
unreserved characters, reserved characters or '%'
We agree with the spec on unreserved
and %
, so lets take those our of list for fragment
. This leaves us with:
fragment = *( sub-delims / ":" / "@" / "/" / "?" )
Per our comment, this should be equivalent to the reserved
character set: reserved = gen-delims / sub-delims
Again, we agree on sub-delims
, so let's take that out. We are left with:
*( ":" / "@" / "/" / "?" ) vs gen-delims
THESE ARE NOT THE SAME THING.
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
Per my reading of the spec these characters are not valid in a fragment: "#" / "[" / "]"
.
However, both our comment, error message, and implementation seem fine letting them through.
https://github.com/icerpc/icerpc-csharp/blob/57995b702436169798fc3fe5fd513e6f09591397/src/IceRpc/ServiceAddress.cs#L198
Just as a sanity check, C#'s URI class does state that it must adhere to RFC 3986 to be considered well-formed.
Per my reading of the spec these characters are not valid in a fragment: "#" / "[" / "]".
I believe these characters are valid in fragments. In particular, I don't see any reason why they would be invalid. The rules are about avoiding ambiguities. The fragment is at the very end of your URI - it's simply everything after the '#' character.
I guess disallowing #
helps reduce ambiguity, even if it isn't strictly necessary?
And maybe allows for more optimized performance when getting the fragment, since you can search from the back of the string for #
if you know there can only be one of them.
That's just speculation on my part though.
A more concrete statement from the spec (emphasis mine):
A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.
Again, just speculation, but for square brackets, it seems like they want to leave the door open for future IPvX formats, and who knows what characters could be valid in 'IPv8'.
See https://dotnetfiddle.net/L7dKKQ
The whole point here is to do what the platform does. It's not about "enforcing the spec" when the platform does not. When we decode a Slice2-encoded service address, we count on the .NET Uri class to perform all URI-related validations. We don't want to validate the already Uri-validated string in case the Microsoft Uri class is not conformant enough. So we just adopt the same lax rules as Microsoft. These may be more lax than the spec but they don't introduce any ambiguity.
Note: when we decode a Slice1-encoded service address with a facet, we use Uri.EscapeDataString to create the fragment: https://github.com/icerpc/icerpc-csharp/blob/bbeb9356f0fec09205c426e99e64e0f0c40fd54e/src/IceRpc/Internal/FragmentSliceDecoderExtensions.cs#L14
EscapeDataString
percent-escapes # and other reserved characters:
https://learn.microsoft.com/en-us/dotnet/api/system.uri.escapedatastring?view=net-8.0
So the fragment we get in this case is fully URI-spec compliant.
On the decoding side, I totally agree. As long as we can decode it, it doesn't really matter if the URI is perfectly-to-spec, and we shouldn't waste effort doing so. If it decodes, it's fine.
But on the encoding side, I think it would be safer to follow the spec. Otherwise a "valid" C# URI might be an "invalid" Rust URI. Having a URI's validity depend on which language (and even which crate in Rust) you're using defeats the point of these being uniform and standardized IMO. What do you think?
Also, if this was some significant effort, I would of dropped this a while ago.
But, we're already performing our own validation of URIs, that's why we have this _notValidInFragment
field in the first place. Right? All we'd have to do is add \\[\\]#
to the list of characters, and maybe spend an hour reviewing the other rules as well.
IMO, following the spec here would be a pretty small effort.
What do you think?
You're missing the point.
The most common way to create a ServiceAddress in C# is from a Uri
. When you do that, the IceRPC code doesn't perform extra validation to make triple-sure Uri
did not let through some invalid character.
Then - for consistency - when you construct a ServiceAddress by hand, and set your own fragment, we perform the same lax but unambiguous validation as System.Uri
.
I see your point.
Keeping consistency with the language is a nice-to-have, but more importantly implementing my 'fix' would be pretty pointless, since it's so rare to construct a ServiceAddress
'by hand'.
I'm still a little worried that this might cause problems down the road, since different languages will probably have inconsistent levels of strictness. But I'm going to close this, since we don't have many languages yet, and this is pretty niche. I mean, you need to use some very strange looking URIs to even get into this concern in the first place.
P.S. I'm dropping this, but am still planning to fix the small typos we found, and (IMO) clarify some minor wordings.
Quote:
a valid fragment contains only unreserved characters, reserved characters or '%'.
This seems pretty dubious to me. If a fragment can contain
unreserved characters
andreserved characters
... isn't that all characters? If this is some quirk of the URL spec, it's really not clear as a casual reader.https://github.com/icerpc/icerpc-csharp/blob/57995b702436169798fc3fe5fd513e6f09591397/src/IceRpc/ServiceAddress.cs#L522-L532