MOLD of URL containing unicode chars is invalid

Oldes commented 5 years ago

Check this:

>> http://foo?šiška
== http://foo?Å¡iÅ¡ka

In comparison, Ren-C is also wrong (in different way):

>> http://foo?šiška
== http://foo?aiaka

Rebol2 and Red are OK:

>> http://foo?šiška
== http://foo?šiška

rgchris commented 5 years ago

I don't observe the same behaviour in Ren-C (Web or Mac Terminal). See also: Ren-C Pull 655.

Oldes commented 5 years ago

Good for Ren-C than... I have a little bit older Ren-C version.

Oldes commented 5 years ago

@rgchris regarding the mentioned Ren-C's pull request, I prefer my version (compatible with Red), where it is like:

>> mold append ftp:// "%28"
== "ftp://%2528"
>> form append ftp:// "%28"
== "ftp://%28"

versus Ren-C's:

>> mold append ftp:// "%28"
== "ftp://%28"
‌‌>> form append ftp:// "%28"
== "ftp://%28"

Oldes commented 5 years ago

@rgchris Also I prefer:

>> to-url {http://foo boo}
== http://foo%20boo

versus Ren-C's:

‌>> to-url {http://foo boo}
== http://foo boo

hostilefork commented 5 years ago

>> to-url {http://foo boo}
== http://foo%20boo

Good point to raise...

But, I do think we're on the right track following along on @rgchris's take that you should be able to round-trip URLs that are copied to and from the address bar of your browser. To the extent it's what's in the viewer's consciousness that is the "source code". I think that's what makes the URL type valuable, more than any automatic escaping does.

Browser rules are weird, though. There's some writing about it in The URL Standard:

*"the path, query, and fragment components of the URL should have their sequences of percent-encoded bytes replaced with code points resulting from percent decoding those sequences converted to bytes, unless that renders those sequences invisible."

(See also URL Escape Guidelines)

For this to work, Chrome has to assume any percents that appear are escaping-percents. So I presume that READ or other URL operations would do the same.

This doesn't give easy or obvious answers to building up URLs programmatically from strings, when those strings aren't escaped. I've been thinking that URLs would be immutable, so you couldn't end up in situations like:

rebol2>> reverse http://example.com
== moc.elpmaxe//:ptth

With immutability and being forced to use JOIN, there's a moment you can check for badly formed URLs. And I'd suggest that noticing % that weren't %-escapes, or stray spaces would be disallowed:

ren-c>> join http://example.com/ "100%"
** Script error: Percent in URL! must be hex encoded character bytes

ren-c>> join http://example.com/ "abc def"
** Script error: Space not legal in URL, use URL-ENCODE before joining

ren-c>> url-encode "abc def"
== "abc%20def"  ; a text string, not a URL!

ren-c>> join http://example.com/ url-encode "abc def"
== http://example.com/abc%20def

It implies that when you're building up a URL out of string components that are arbitrary text (not known-good characters for a browser-ready URL) and using URL-ENCODE on those bits, you might do more escaping than necessary. Hence you might need some kind of CANONIZE-URL to bring it in line with what Chrome does in the address bar. That probably should not be automatic.

Curiously, this URL shows with quotes in Chrome's address bar, but you get %22 when you copy it to the clipboard:

https://en.wikipedia.org/wiki/%22Heroes%22_(David_Bowie_album)

Lots to think about here.

hostilefork commented 5 years ago

@rgchris Actually, it looks like the copy-to-clipboard in Chrome escapes this as well:

https://en.wikipedia.org/wiki/Herg%C3%A9

I was previously under the impression it did not. That influences my thinking on this a bit, in light of the space issue.

(It actually seems to only do the escaping if you have the whole URL selected when you copy, not just part.)

Can you give an updated outline of your philosophy here?

The main thing I guess is just that it seems that if URL! is going to be a generically useful type in the system with custom schemes, it seems a waste to force them all through the very ugly percent-encoding, which seems very much an archaic legacy-type thing.

But maybe it's still acceptable to say that the URL rule is that the only percents you can have are for the purposes of hex-byte-character encodings. That encoding is apparently not just in the URL encoding standard but for any URI.

Oldes commented 5 years ago

Regarding this:

rebol2>> reverse http://example.com
== moc.elpmaxe//:ptth

I think that result should be: #[url! "moc.elpmaxe//:ptth"] . As for any url without valid scheme, so also with:

>> to-url "foo"
== #[url! "foo"]

hostilefork commented 5 years ago

I think that result should be: #[url! "moc.elpmaxe//:ptth"]

I used to think along these lines, that escaping and generality of forms was important.

But there is Freedom To and Freedom From. "Freedom To" store arbitrary strings and flavor them as URL! is robbing you of your "Freedom From" being passed a URL that has no scheme and is not URL-like whatsoever. You effectively know nothing about its form. Also obviously you wind up with these not-very-appealing construction syntaxes.

I feel like the part could have more value by giving it a few more guarantees. If those guarantees don't suit you then you always can convert to a string and work with that. And if you find yourself wanting to save URL!s in a file that aren't LOAD-able, you should be the one coming up with the notation for that...because it's going to be you who's responsible for building that valid URL later.

Of course this is new and so there's a lot of testing and figuring. The closest historically would be the rules on WORD! and how their immutability and creation-at-one-moment lets you impose rules on what letters are allowed. (That's another idea I've changed my feelings on, that we don't necessarily make the total world simpler by letting you have escaped forms.)

https://forum.rebol.info/t/any-word-and-any-string-the-limits-of-unification/1127

rgchris commented 5 years ago

Can you give an updated outline of your philosophy here?

If people are sharing the incorrect version (i.e. not correctly escaped) then it would be preferable to support it within reason.

Chrome (and Firefox) does indeed appear to escape when copying, Safari does not.

Also, you can put the unescaped version in a link and browsers will do the translation.

There's a bit of a human element to this too. If you're composing a URL in a text file, which is more natural to write?

https://en.wikipedia.org/wiki/Herg%C3%A9
(or)
https://en.wikipedia.org/wiki/Hergé

rgchris commented 5 years ago

Fun—Github's markdown automatically escaped the linked URL to %C3%A9 : )

hostilefork commented 5 years ago

Also, you can put the unescaped version in a link and browsers will do the translation.

Apparently this is only technically legal since the RFCs related to HTML5.

...RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It's therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.

But in this context we have another problem: HTML escaping. If a URL contains quotes, then how to put it in quotes, etc. The example in the answer at the bottom of the above SO question shows some of the complexities:

Unescaped:

https://example.com/?user=test&password&te&st&goto=https://google.com

"Legit URL"

https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com

The variation that is suitable to put into an <a href>

https://example.com/?user=test&amp;password&amp;te%26st&amp;goto=https%3A%2F%2Fgoogle.com

And Rebol can't LOAD the last one as a URL, because it has semicolons in it, so they are cut off as comments. :-/ So that's a good point on why you can't copy and paste an arbitrary href style URL out of a web page into a URL! value, or vice versa. But if we are thinking of truly saying a URL! ends at whitespace as the delimiter, then that would make it seem that semicolons should be picked up in the URL...just like a semicolon inside quotes is picked up in a string. {abc ; not a comment}

If I had to pick my own moral-of-the-story, it would be that text is a terrible medium for building structured documents. A tree/graph data structure represented unambiguously via a binary format would be so...much...better!

@rgchris points out that Safari doesn't escape the URL when you copy/paste. Maybe that lines up with the evolution of the RFCs allowing the non-ASCII characters in href...that the long tail is going to be the browsers aim to give you a readable link, and everything is under the hood. Wouldn't Rebol scripts want to be showing what you see on the screen in your source?

The alternative is that Rebol push back and become the biggest W3C stickler in the world, as a way to "sort out the mess". But I think my general feeling is like @rgchris's--that it is swimming upstream. The strict format is more likely to frustrate people trying to use the URL! type how they want to in their source and dialects, as opposed to be appreciated for its limitations.

Oldes commented 5 years ago

Rebol can handle the last url... it is just that you cannot paste anything in the console and expect, that it will be loaded. When reading your post in email, the url is also not recognized as an url.

hostilefork commented 5 years ago

Rebol can handle the last url

For some definition of "handle"... the browser knows to turn the & into an ampersand before sending, but only because it got the URL out of a href= field. It would be presumptuous for READ to assume it should do that (and every such "weird" automatic behavior like that creates vulnerabilities...)

In any case, it's not able to LOAD it (unless you use the #[url! "..."] form, assuming that's considered a good idea). And so that impacts the definition.

Oldes commented 5 years ago

I don't say that url! is ideal... but we are now out of topic.. I was hit with the mentioned issue:

>> http://foo?šiška
== http://foo?Å¡iÅ¡ka

that was fixed and I have my script working as I wanted. The rest is unrelated. If you want to handle & as it does browser in its input field or I don't know where, there should be some function which will decode it (I think I had some of these in Rebol2 times... just have no need for it in R3 yet)

metaeducation / rebol-issues

MOLD of URL containing unicode chars is invalid #2379