remotestorage / spec

remoteStorage Protocol Specification
https://tools.ietf.org/html/draft-dejong-remotestorage
87 stars 5 forks source link

define URI_ENCODE #103

Open ghost opened 8 years ago

ghost commented 8 years ago

URI_ENCODE is not defined in the specification. It probably refers to this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI but this needs to be made explicit.

Also: it should be made explicit whether or not the server needs to decode this, and how, and what should be shown in folder listings, the encoded or decoded variant.

ghost commented 8 years ago

JS behavior:

encodeURI('http://www.example.org/foo/bar/baz foo.txt');

/*
http://www.example.org/foo/bar/baz%20foo.txt
*/

decodeURI('http://www.example.org/foo/bar/baz%20foo.txt');

/*
http://www.example.org/foo/bar/baz foo.txt
*/

decodeURI('http://www.example.org/foo/bar/baz%2ffoo.txt');
/*
http://www.example.org/foo/bar/baz%2ffoo.txt
*/
raucao commented 8 years ago

That looks correct.

ghost commented 8 years ago

So ideally we describe the (complete) mapping between what is supposed to go out of the client, and what is supposed to be in the directory listing and describe allowed/disallowed value.

For example, what happens when the client sends: https://storage.example/foo/bar%2fbaz.txt? Should this be represented as https://storage.example/foo/bar/baz.txt or rejected with a 400? What happens when the client send foo%2e.txt? Represented as foo.txt? Etc.

untitaker commented 8 years ago

None of those choices are really enforcable. Depending on which proxies you install (reverse-proxies or otherwise), you'll get different behavior that isn't controllable by the server.

IMO if you decide on one, it should come with a very strong recommendation for clients to not rely on this behavior. But if you specify behavior which can't be relied on, why specify it in the first place?

ghost commented 8 years ago

All very true!

What was the rationale to have this URI_ENCODE requirements in clients in the first place? Also, the client should possibly be able to "understand" encoded entries in directory listings... I guess that would solve all issues?

Update: or maybe clients do not need to understand, they should just not fiddle with entries from the directory listing, but use them as-is without calling URI_ENCODE again...

untitaker commented 8 years ago

Also, the client should possibly be able to "understand" encoded entries in directory listings... I guess that would solve all issues?

You mean the client should be able to deal with both encoded and unencoded entries? Having to do that in practice would be a worst-case scenario, but I don't think we should require it...

Update: or maybe clients do not need to understand, they should just not fiddle with entries from the directory listing, but use them as-is without calling URI_ENCODE again...

The 5apps storage doesn't seem to urlencode unsafe characters in directory listings. I.e., it gives me this listing:

{
    "@context": "...",
    "items": {"äää": ...}
}

but on PUT requests I obviously have to encode äää (and do).

ghost commented 8 years ago
var url = "http://localhost/a b c.jpg";
var url = encodeURI("http://localhost/a b c.jpg");

Using XMLHttpRequest() with those URLs results in the exact same request. The space gets encoded to %20.

Now,

var url = "http://localhost/a b%20c.jpg";
var url = encodeURI("http://localhost/a b%20c.jpg");

Results in two different request URIs: http://localhost/a%20b%20c.jpg and http://localhost/a%20b%2520c.jpg (double encoding in the second scenario).

To sum up: the specification requests that the client uses URI_ENCODE, which is of course necessary! But the browser already does that, so the library should not do that again...

If URI_ENCODE is mentioned in the spec then it needs to make explicit which encoding it is, talk about double encoding and define a mapping between request URL (from the client) to how the server should interpret it, e.g. how to show it in directory listing.

untitaker commented 8 years ago

But the browser already does that, so the library should not do that again...

The browser is part of the API client. Whether we call encodeURI or the browser doesn't matter as long as it results in a correctly formatted HTTP request.

If URI_ENCODE is mentioned in the spec then it needs to make explicit which encoding it is, talk about double encoding and define a mapping between request URL (from the client) to how the server should interpret it, e.g. how to show it in directory listing.

We might as well drop any mention of URI_ENCODE from the spec, and clearly specify that nodenames in folder listings should not be URL encoded. That nodenames have to be urlencoded in URLs is already required by other rfcs.

ghost commented 8 years ago

The browser is part of the API client. Whether we call encodeURI or the browser doesn't matter as long as it results in a correctly formatted HTTP request.

No it is not. The client could also be cURL, which does not URL encode by itself, which is actually funny:

[fkooman@localhost ~]$ curl -v "http://localhost/foo bar baz.txt"
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /foo bar baz.txt HTTP/1.1
> User-Agent: curl/7.40.0
> Host: localhost
> Accept: */*

We might as well drop any mention of URI_ENCODE from the spec, and clearly specify that nodenames in folder listings should not be URL encoded.

That would be nice! But again, what is URL encoded? it has to be made explicit somewhere...

That nodenames have to be urlencoded in URLs is already required by other rfcs.

Which ones? I guess you are right, but referencing them would be nice :)

untitaker commented 8 years ago

Which ones? I guess you are right, but referencing them would be nice :)

https://tools.ietf.org/html/rfc3986

You just can't have spaces in paths.

URL-encoding is also specified in that RFC.

untitaker commented 8 years ago

No it is not. The client could also be cURL, which does not URL encode by itself

Right, such a request would violate the spec.

Why would the server care whether the browser or the JS code URL-encodes? Either result in valid requests => we have a conformant client.

ghost commented 8 years ago

From that RFC:

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

rs.js and browser both encode, so it is done twice in the client.

untitaker commented 8 years ago

The browser only does that on broken URLs, as you can clearly see from the first example you posted. It's not a simple URI-encode, it's decode-and-encode. Or something more complicated.

In the second example the first URI is fixed up again by replacing the unsafe space with %20, the second URI is just broken.

ghost commented 8 years ago

You lost me. The whole point is to send "non-broken" URLs in requests, why else would you encode? Just for the fun of it?

untitaker commented 8 years ago
var url = "http://localhost/a b c.jpg";
var url = "http://localhost/a%20b%20c.jpg";

They result in the same request, but for different reasons.

  1. The first URL is broken. The browser sees that spaces are not allowed in paths, and fixes it.
  2. The second URL is completely correct. The browser doesn't attempt to fix it up and sends it as-is.

This isn't a simple URL-encode done by the browser, otherwise the second URL would be doubly encoded. It isn't, otherwise you'd notice for sure.

It's much more closer to "url-encode all unsafe characters, except for %".

This isn't some algorithm that's specified in an RFC, it's something browsers came up with to work around defective websites.

raucao commented 8 years ago

Great explanation. I just deleted my half-written worse one. :)

ghost commented 8 years ago

Ah, so the browser leaves properly encoded URLs alone and does not modify them anymore, so there is only one encoding in that situation. Cool :)

Now we only need to define what URI_ENCODE does exactly (rs.js uses encodeURI) and how to reverse that encoding.

michielbdejong commented 8 years ago

@fkooman ok, can you take care of making this clearer in the spec?

ghost commented 8 years ago

@michielbdejong sure, do you have any additional thoughts on this issue?

raucao commented 8 years ago

(And maybe a hint where URI_ENCODE came from. ;))

michielbdejong commented 8 years ago

We had a long discussion about that in 2011, about how "a b c.jpg" and "a%20b%20c.jpg" are the same resource on the server, and especially also how a%2Fb.jpg and a%2fb.jpg are the same resource, but a%2Fb.jpg and a%2FB.jpg are not. It's all pretty obvious when you realize that the server should only think about the URL paths after decoding them.

Seems URI_ENCODE is defined in RFC3986? https://en.wikipedia.org/wiki/Percent-encoding#Current_standard

untitaker commented 8 years ago

how a%2Fb.jpg and a%2fb.jpg are the same resource, but a%2Fb.jpg and a%2FB.jpg are not.

That much is already obvious. They're the same hex value. Confusion was more regarding a/b%2f/c vs a%2fb%2fc, or a@b vs a%40b. Nginx treats those as the same value, Apache doesn't. I'm not really sure whether we need to clarify this kind of behavior in the remoteStorage RFC.

Seems URI_ENCODE is defined in RFC3986?

Not with that exact name at least.

michielbdejong commented 8 years ago

Slashes have a function in our URLs, ampersands don't. So I think a/b%2f/c and a%2fb%2fc are both allowed and different, a%40b is also allowed, and behavior for a@b is undefined, right?

ghost commented 8 years ago

I think what would be really helpful is to have a mapping between the request URI and how it shows up in the folder listing.

This can be done by pointing to a definition of URI_ENCODE/DECODE and be done with it once and for all :)

michielbdejong commented 8 years ago

Item names MAY contain all characters except '/' and the null character

So we don't have to worry about how to deal with URLs that contain encoded slashes, it will always be clear where each item name starts and stops. Let's choose the decodeURIComponent JavaScript function as the official way to determine which item name is meant by a certain string between two slashes, or after the last slash.

michielbdejong commented 8 years ago

The server should probably respond with something like a 400 if the URL refers to item names that contain slashes or null characters, because those are not allowed in item names.

michielbdejong commented 8 years ago

Or we can just say the server's behavior is undefined in this case.

ghost commented 8 years ago

Can we make this list explicit? %2F, %2f, /, %00?

ghost commented 8 years ago

%2F, %2f, %00?

ghost commented 8 years ago
check if request URI contains any of %2F, %2f, %00
    return 400;
split request uri (after storage root) on '/'.
foreach component -> decodeURIcomponent
untitaker commented 8 years ago

So I think a/b%2f/c and a%2fb%2fc are both allowed and different

As stated, not possible with nginx. It conflates all of those.

a%40b is also allowed, and behavior for a@b is undefined, right?

Both are valid URLs and mean the same thing, because @ is not a reserved character in paths.

ghost commented 8 years ago

Also nice one: /foo/%2e%2e/bar/documents/abc.txt points to what? Files of user foo, or of user bar? Or is invalid? I guess it must be normalized first...

Another requirement: normalize the request URI first.

raucao commented 8 years ago

points to what? Files of user foo,

The Webfinger record tells you the storage root for the user. It can vary between servers, and there's no rule for having the username as first part of the full path for a document or folder. No?

untitaker commented 8 years ago

Foo is definetly the first path segment, even normalized. You'd have to try what nginx makes of it though.

On 18 November 2015 11:10:11 CET, "François Kooman" notifications@github.com wrote:

Also nice one: /foo/%2e%2e/bar/documents/abc.txt points to what? Files of user foo, or of user bar? Or is invalid? I guess it must be normalized first...


Reply to this email directly or view it on GitHub: https://github.com/remotestorage/spec/issues/103#issuecomment-157665714

Sent from my phone. Please excuse my brevity.

michielbdejong commented 8 years ago

It's allowed to call a folder or a document '..', and it's up to the implementer to work out how to use nginx or any other software in the right way to get the correct result.

What @fkooman summarized in https://github.com/remotestorage/spec/issues/103#issuecomment-157381023 is correct, with the side note that we should make the 400 a 'MAY' in the spec - the server may respond with anything it wants to nonsense requests (i.e. requests whose target URL includes illegal item names)

ghost commented 8 years ago

well, even if the spec says that you MUST support .. as file or folder name I won't implement that :-)

raucao commented 8 years ago

Oh, now I understand your question. I think you're conflating file systems with Web URLs. An encoded .. is allowed at the moment. It might pose a problem for some clients (like e.g. a FUSE module maybe), but in general I would also agree that that's the implementor's problem to solve.

Just practically speaking, how would you prevent that folder name other than flat-out black-listing strings you don't like?

untitaker commented 8 years ago

It seems that Firefox doesn't even support that properly. It decodes the dot after sending the request, and refreshing the page sends the dots decoded.

IMO the current state re allowed characters is fine, practical restrictions will come up anyway.

On 19 November 2015 15:28:00 CET, "François Kooman" notifications@github.com wrote:

well, even if the spec says that you MUST support .. as file or folder name I won't implement that :-)


Reply to this email directly or view it on GitHub: https://github.com/remotestorage/spec/issues/103#issuecomment-158072770

Sent from my phone. Please excuse my brevity.

raucao commented 8 years ago

It seems that Firefox doesn't even support that properly. It decodes the dot after sending the request, and refreshing the page sends the dots decoded.

That's because you don't have to encode dots, no?

untitaker commented 8 years ago

/%2e%2e/ is sent like that by Firefox, then the URL bar is corrected to /../ (not a redirect), the next request is sent with decoded dots, and the server normalizes /../ by removing it and the previous path segment.

On 19 November 2015 15:34:56 CET, Sebastian Kippe notifications@github.com wrote:

It seems that Firefox doesn't even support that properly. It decodes the dot after sending the request, and refreshing the page sends the dots decoded.

That's because you don't have to encode dots, no?


Reply to this email directly or view it on GitHub: https://github.com/remotestorage/spec/issues/103#issuecomment-158074397

Sent from my phone. Please excuse my brevity.

michielbdejong commented 8 years ago

https://github.com/remotestorage/spec/issues/103#issuecomment-158072770

well, even if the spec says that you MUST support .. as file or folder name I won't implement that :-)

You naughty boy! :)

https://github.com/remotestorage/spec/issues/103#issuecomment-158074915

is sent like that by Firefox

We don't care how a client takes UI input and does translations before it sends the request - we only care about which URL reaches the server in the actual http request.

untitaker commented 8 years ago

I said that a refresh of the page sends a different request.

On 19 November 2015 15:47:14 CET, Michiel de Jong notifications@github.com wrote:

https://github.com/remotestorage/spec/issues/103#issuecomment-158072770

well, even if the spec says that you MUST support .. as file or folder name I won't implement that :-)

You naughty boy! :)

https://github.com/remotestorage/spec/issues/103#issuecomment-158074915

is sent like that by Firefox

We don't care how a client takes UI input and does translations before it sends the request - we only care about which URL reaches the server in the actual http request.


Reply to this email directly or view it on GitHub: https://github.com/remotestorage/spec/issues/103#issuecomment-158077595

Sent from my phone. Please excuse my brevity.

untitaker commented 8 years ago

Sorry, I have to correct: F5 sends the same request. However, hitting enter in the location bar sends a different one.

In either case nginx conflates /%2e%2e/ with /../.

In the end these are practical restrictions of HTTP. I don't think the remoteStorage spec should be concerned with any of them. As already stated above, I think a implementors note for clients should be appropriate though.

michielbdejong commented 8 years ago

@untitaker we don't care if the http request came from page request or not, we don't even care if it came from a browser or from a command-line tool. All client-side processes leading up to the http request are out of scope. We just define how a server should respond to a request once it's received.

michielbdejong commented 8 years ago

feel free to add an implementor's note for clients

ghost commented 8 years ago

actually. the server will usually normalize the URL and the /../ and /./ stuff will just be removed.

untitaker commented 8 years ago

@fkooman Yes, that of course, but the same applies to /%2e%2e/.

ghost commented 8 years ago

yup

untitaker commented 8 years ago

https://github.com/remotestorage/spec/pull/112

ghost commented 8 years ago

I'd prefer something more along these lines:

Server implementations MUST normalize URLs according to the rules in https://tools.ietf.org/html/rfc3986#section-6.

As for dot segments, see: https://tools.ietf.org/html/rfc3986#section-5.2.4