servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.27k stars 318 forks source link

feature request: more complicated percent-encoding rules #830

Closed chayleaf closed 1 year ago

chayleaf commented 1 year ago

Hello, I'm the author of the urn crate. In URN components, certain characters can only be included starting from the second byte (so it has to be percent encoded in the first byte only), and there's one component that can contain both ? and = but can't contain ?=. Is adding such rules a feature you may consider? I'm willing to contribute the feature myself.

Possible implementation: add a separate function to add a character alongside a fn(&[u8], usize) -> bool, with arguments indicating the input slice and the position in that slice. If the character is then added again, but without a function, it overrides the conditional character.

valenting commented 1 year ago

Hi,

Could you explain how such a feature would help with the urn crate, or how it would help with URL parsing in general? How would you use it? Note that this crate explicitly only implements the URL standard, so it does deviate from RFC 3986. Keep that in mind if you want to use it as a dependency.

Thanks!

chayleaf commented 1 year ago

In URNs, %2F and / aren't necessarily semantically equivalent, i.e. the namespace-defined rules may specify different semantics for raw and percent-encoded chars. This means that I need to add a customizable percent-encoding function which lets the developer specify which chars they need to escape.

At the same time, I want the default rules to escape the absolute bare minimum. Since the allowed characters differ based on the position in the component, finding the bare minimum requires knowing the position of each char in the component.

But now that I think of it, you can't know the position of the char in the component if the user provides part of the component. E.g. if the user wants to percent-encode some parts and separate it with raw /, they will pass each part, but the position of those parts within the component will stay unknown.

For that reason, I'll probably add a "fixup" function that ensures the entire string is a valid component by percent encoding invalid chars, but doesn't encode percent signs. This seems a little error prone, but I don't see an alternative.

So since this can't be solved on the level of this crate, I'm closing this.