Internal representation of URLs

rebolbot commented 11 years ago

Submitted by: Ladislav

URLs are sequences of octets.

In R3, URL's are currently internally implemented as "sequences of characters", i.e., as a "special string type".

The LOAD function transforms the "external representation" to "internal representation", which, in R3, is a sequence of Unicode code points. Observations (see #2011 and #2013) suggest that only the code points in the 0 to 255 range are used by LOAD at present to "internally encode" any "externally represented URL".

Unfortunately, the "external to internal format" transformation is incorrect as #2011 proves, not preserving the "encoded delimiters" transforming them to "unencoded delimiters". Thus, LOAD transformation currently misrepresents (loses) external information, namely the information which delimiters were originally "encoded", i.e., meant as data.

In addition to misrepresenting the "external information" the problem also is that in Rebol URLs are mutable values and Rebol mutating actions should be unrestricted, meaning that any modification should be allowed, producing a valid internal representation.

^{CC - Data [ Version: r3 master Type: Bug Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]}

rebolbot commented 11 years ago

Submitted by: BrianH

URLs outside of Rebol are defined as a series of octets, but inside Rebol they don't necessarily need to be. Many schemes in Rebol could just be URL-looking syntax for something Rebol-specific, like an ODBC scheme. I think that it should be up to the scheme to handle any encoding to octets at runtime, if such a thing is necessary or possible for that particular scheme (for example it would be possible for HTTP, unnecessary and inappropriate for ODBC).

A bigger problem is how we will need to handle hex encoding, because octets or codepoints, we need to put off decoding encoded characters until DECODE-URL breaks the URL into parts first. We need to handle the URL syntax stuff, splitting into domain, user, password, path and fragment, before any hex decoding is done.

rebolbot commented 11 years ago

Submitted by: Ladislav

"I think that it should be up to the scheme to handle any encoding to octets at runtime, if such a thing is necessary or possible for that particular scheme (for example it would be possible for HTTP, unnecessary and inappropriate for ODBC)." - In this ticket I wanted to discuss other issues, feeling that the internal representation (format) must be chosen and used before any scheme-specific handling is employed. If it is chosen incorrectly, then correct scheme-specific handling may become impossible, in fact. (which is actually happening at present) Once the LOAD result loses necessary information, DECODE-URL cannot recover the lost information any more.

rebolbot commented 11 years ago

Submitted by: BrianH

Agreed, that is definitely the problem we have here. Actually, we have two separate problems with the current internal representation: Unicode and escaping.

As mentioned in #482, #1986 and #2013, we need an internal representation that can handle Unicode characters in url! values. On a semantic model level, if we want to keep url! as a member of any-string! then it should be logically a series of char! values, which are Unicode codepoints in R3. On a practical level, we want to be able to use PARSE to process url! values, but we also want to be able to support schemes that can in one way or another handle Unicode, such as modern HTTP and ODBC.

The actual matter of how an individual scheme processes Unicode data is up to the scheme, but the internal model needs to be able to store the Unicode data for it to process. That means that whatever internal model we choose will need to be encoded from and decoded to Unicode characters when you are accessing the contents of the url!. It would be best if that decoding is done in the url! type actions themselves and just providing a char! API, hiding the actual internal physical representation so as to avoid data corruption.

Escaping is a separate matter, but the internal model does need to take it into account or else, like you say, data is lost before we get to DECODE-URL. Also, this escaping problem is actually scheme-independent, a side effect of basing the url! type more-or-less on the web-standard URL/URI model. The URL/URI data model makes a distinction between characters that are syntax and characters that are considered data. When there is a conflict, where you want to use syntax characters as data, they provide another way of encoding that doesn't conflict with the syntax: percent encoding (let's assume that I am talking about the extended percent encoding that includes Unicode support). This gives them two ways to represent the same character, one that is considered to be syntax, and one that is not. This distinction is good to emulate for the url! data model, even though url! is logically a series of codepoints (any-string!) rather than a series of octets. The external syntax is also good to emulate (see #1986 and #2013 for details), though we might want to also support most printable non-whitespace Unicode characters directly in url! syntax (see #482 and #2013 for details).

We definitely need an internal escaping method to make the distinction between syntax and data, because the any-string! model is just a series of characters, and the intended use of a character is not part of the model. That doesn't mean that we want to use the URI/URL percent escaping for our internal data model for url! values. As an internal model, percent encoding has several downsides:

R3 any-string! types are not made up of octets, they are made up of codepoints, so an octet-based encoding of escapes is too awkward for us to want to expose to users (through actions and path behavior, not syntax) because the octets don't correspond to the characters they collectively represent. See your example code in #2013 for the result of this.
The "%" used in percent encoding as an escape character is a common, printable character, which people would frequently want to use in a url! as data, particularly in passwords and URL-encoded query values. Requiring special treatment of such a common character is awkward.
Percent encoding uses at least two characters to encode the escaped character, and they're hex digits. If this encoding model is exposed to the user then escaping requires generating 3 characters, two of which require end-user math to calculate.
Unicode support in percent encoding requires converting to UTF-8 and then hex-encoding each byte. This multiplies the awkwardness and potential for data corruption.

One way to deal with these issues is to use a different method for escaping internally: Use a single normally-nonprinting escape character and don't hex-encode the character it is escaping, just put the escaped character there directly. Internally we have no reason to limit ourselves to printing characters, and use of these characters in data is rare. For that matter, ASCII has a character specifically intended to do this kind of escaping: #"^(1B)". We wouldn't need to escape any characters except the escape itself and syntax characters like "%", ":", "@", "/", "?" or "#" when they are used as non-syntax data.

We could even expose the escape character in the user data model so the PARSE rules in DECODE-URL can react to it - since no hex or UTF-8 encoding is involved, the escape character can just be considered internal markup. For that matter, since INSERT, APPEND and CHANGE have /only options that aren't otherwise being used for string types, we can even have their default behavior without /only generate the escape characters directly if we like, or insert pre-escaped data with /only. We could even have a function that takes percent-encoded data and returns escaped data (I mean other than LOAD, since it could work on url! fragments, not just full url! syntax), and another that does the opposite.

This would hide the difficulties of percent encoding from users that are generating url! values dynamically rather than writing them out in syntax, and it would maintain equivalence between characters that were unnecessarily percent-encoded and those characters not encoded, which particularly comes in handy for Unicode data.

If we decided to use percent encoding internally as the escaping method instead of the ESC character proposal, we should still have LOAD, INSERT, APPEND and CHANGE decode the percent encoding for non-syntax characters, including Unicode, with the same pre-encoded-data /only treatment. MOLD could regenerate that percent encoding for Unicode as I mentioned in #2013, but internally all unnecessary percent encoding should be decoded. We don't want users to have to process percent encoding when they don't need to, and don't want to ever require them to do UTF-8 encoding and decoding themselves.

rebolbot commented 11 years ago

Submitted by: BrianH

The one trick of all this is that while the need for a Unicode encoding and syntax character escaping model is scheme-independent, the actual encoding and escaping model generated within the scheme handlers themselves would be scheme-specific.

The syntax of the url! type that LOAD accepts and MOLD generates would have one escaping model, which woul;d be scheme-independent. The internal data model could have another encoding method for Unicode, also scheme-independent as long as the url! actions returned decoded characters, and a way to mark escaped characters that is also scheme-independent.

The only scheme-dependent thing would be how the port scheme handlers translate the scheme-independent escaped data to the differently escaped data model that the implemented protocol requires. As long as the scheme can determine the difference between syntax and data using DECODE-URL, it can know what to translate to what it requires for its own external purposes.

rebolbot commented 11 years ago

Submitted by: Ladislav

'wouldn't do any octet encoding at all, they would only be concerned with escaping (which is currently done as a "%" and two hex digits). ' - you are just using a terminology incompatible with the RFC. According to the RFC #3986 the usage of '#"%" followed by two hex digits' is called "percent encoding", which is what I am respecting to maintain consistency.

rebolbot commented 11 years ago

Submitted by: BrianH

you are just using a terminology incompatible with the RFC

There is no RFC for Rebol syntax or semantics. We are not discussing a limited RFC-compatible external definition of a URL, we are discussing Rebol's definition (which I will call url! to avoid further confusion). The scheme handlers that need to generate RFC3986 URLs will convert url! values to the appropriate octet streams. The scheme handlers that wrap Unicode APIs (like ODBC) won't need to do that kind of conversion, and shouldn't.

On the other hand, using RFC3986 as the inspiration for the url! type's external representation (its syntax in Rebol source) is not a bad idea, even though url! values would be Unicode. Regardless of how we decide to implement escape characters internally, we can still have MOLD generate percent encoding (including the UTF-8 sequence encoding for Unicode characters, don't know the RFC for that, but see #1986), and LOAD understand it. There are real advantages to internally encoding these escape characters using a different escape value, particularly one of the non-printing characters rather than "%", and some disadvantages, but that is a different issue from how the url! value is treated by LOAD and MOLD. TO-STRING is another issue, which would depend on the internal escaping method for syntax characters in the url! type.

rebolbot commented 11 years ago

Submitted by: BrianH

Advantage to using percent-encoding internally:

Bug-for-bug compatibility with R2.
The syntax resembles the internal data, for better or worse (hint: worse).
We can use an 8-bit-element series to encode a url! internally.

rebolbot commented 11 years ago

Submitted by: BrianH

"LOAD is accepting an "external representation", which is a string (Unicode, in case of R3)."

Um, no. Unicode is not a binary encoding, it is a series of codepoints which may be encoded in one of any number of binary encodings. LOAD only loads UTF-8 binary data. When you LOAD a string! value, it converts the string to UTF-8 binary before it loads it. Just a clarification, it doesn't matter for the purposes of this ticket.

"I.e., LOAD does not convert from UTF-8 to Unicode, but, funnily enough, from Unicode to something resembling UTF-8 (it isn't exact UTF-8 due to the decisions I am describing in this ticket)."

Well, for the url! type that is a problem. However, if LOAD actually converted to UTF-8 that wouldn't be a problem, as long as escape sequences were handled somehow internally, and as long as any characters in the url! were decoded from UTF-8 before they were returned. No individual bytes visible to Rebol code, all UTF-8 encoding handled by the internal code, a true internally-variable-length string type. PICK, POKE and INDEX? no longer O(1), sorry.

Making the url! type use the same internal encoding as string!, plus the internal escape sequence, that would make PICK, POKE and INDEX? O(1) again. Unless we decided to implement url! and string! as UTF-8 or UTF-16 of course.

rebolbot commented 11 years ago

Submitted by: BrianH

"and where our points differ" - After your comment edits, our points differ in a completely different way now :)

rebolbot commented 8 years ago

Submitted by: Ladislav

"... whatever internal model we choose will need to be encoded from and decoded to Unicode characters when you are accessing the contents of the url! It would be best if that decoding is done in the url! type actions themselves and just providing a char! API, hiding the actual internal physical representation so as to avoid data corruption." - hmm, this looks natural, but it probably isn't possible. Any Unicode code point we find in the externally represented URL may either be a real Unicode, or it may be a sequence of octets looking like a Unicode code point only by chance. While in the former case it can be handled on a per-codepoint basis, in the latter case it is necessary to handle it on a per octet basis.

metaeducation / rebol-issues

Internal representation of URLs #2014