metaeducation / rebol-issues

6 stars 1 forks source link

Internal representation of URLs #2014

Open rebolbot opened 11 years ago

rebolbot commented 11 years ago

Submitted by: Ladislav

URLs are sequences of octets.

In R3, URL's are currently internally implemented as "sequences of characters", i.e., as a "special string type".

The LOAD function transforms the "external representation" to "internal representation", which, in R3, is a sequence of Unicode code points. Observations (see #2011 and #2013) suggest that only the code points in the 0 to 255 range are used by LOAD at present to "internally encode" any "externally represented URL".

Unfortunately, the "external to internal format" transformation is incorrect as #2011 proves, not preserving the "encoded delimiters" transforming them to "unencoded delimiters". Thus, LOAD transformation currently misrepresents (loses) external information, namely the information which delimiters were originally "encoded", i.e., meant as data.

In addition to misrepresenting the "external information" the problem also is that in Rebol URLs are mutable values and Rebol mutating actions should be unrestricted, meaning that any modification should be allowed, producing a valid internal representation.

CC - Data [ Version: r3 master Type: Bug Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]

rebolbot commented 11 years ago

Submitted by: BrianH

URLs outside of Rebol are defined as a series of octets, but inside Rebol they don't necessarily need to be. Many schemes in Rebol could just be URL-looking syntax for something Rebol-specific, like an ODBC scheme. I think that it should be up to the scheme to handle any encoding to octets at runtime, if such a thing is necessary or possible for that particular scheme (for example it would be possible for HTTP, unnecessary and inappropriate for ODBC).

A bigger problem is how we will need to handle hex encoding, because octets or codepoints, we need to put off decoding encoded characters until DECODE-URL breaks the URL into parts first. We need to handle the URL syntax stuff, splitting into domain, user, password, path and fragment, before any hex decoding is done.

rebolbot commented 11 years ago

Submitted by: Ladislav

"I think that it should be up to the scheme to handle any encoding to octets at runtime, if such a thing is necessary or possible for that particular scheme (for example it would be possible for HTTP, unnecessary and inappropriate for ODBC)." - In this ticket I wanted to discuss other issues, feeling that the internal representation (format) must be chosen and used before any scheme-specific handling is employed. If it is chosen incorrectly, then correct scheme-specific handling may become impossible, in fact. (which is actually happening at present) Once the LOAD result loses necessary information, DECODE-URL cannot recover the lost information any more.

rebolbot commented 11 years ago

Submitted by: BrianH

Agreed, that is definitely the problem we have here. Actually, we have two separate problems with the current internal representation: Unicode and escaping.

As mentioned in #482, #1986 and #2013, we need an internal representation that can handle Unicode characters in url! values. On a semantic model level, if we want to keep url! as a member of any-string! then it should be logically a series of char! values, which are Unicode codepoints in R3. On a practical level, we want to be able to use PARSE to process url! values, but we also want to be able to support schemes that can in one way or another handle Unicode, such as modern HTTP and ODBC.

The actual matter of how an individual scheme processes Unicode data is up to the scheme, but the internal model needs to be able to store the Unicode data for it to process. That means that whatever internal model we choose will need to be encoded from and decoded to Unicode characters when you are accessing the contents of the url!. It would be best if that decoding is done in the url! type actions themselves and just providing a char! API, hiding the actual internal physical representation so as to avoid data corruption.

Escaping is a separate matter, but the internal model does need to take it into account or else, like you say, data is lost before we get to DECODE-URL. Also, this escaping problem is actually scheme-independent, a side effect of basing the url! type more-or-less on the web-standard URL/URI model. The URL/URI data model makes a distinction between characters that are syntax and characters that are considered data. When there is a conflict, where you want to use syntax characters as data, they provide another way of encoding that doesn't conflict with the syntax: percent encoding (let's assume that I am talking about the extended percent encoding that includes Unicode support). This gives them two ways to represent the same character, one that is considered to be syntax, and one that is not. This distinction is good to emulate for the url! data model, even though url! is logically a series of codepoints (any-string!) rather than a series of octets. The external syntax is also good to emulate (see #1986 and #2013 for details), though we might want to also support most printable non-whitespace Unicode characters directly in url! syntax (see #482 and #2013 for details).

We definitely need an internal escaping method to make the distinction between syntax and data, because the any-string! model is just a series of characters, and the intended use of a character is not part of the model. That doesn't mean that we want to use the URI/URL percent escaping for our internal data model for url! values. As an internal model, percent encoding has several downsides:

rebolbot commented 11 years ago

Submitted by: BrianH

The one trick of all this is that while the need for a Unicode encoding and syntax character escaping model is scheme-independent, the actual encoding and escaping model generated within the scheme handlers themselves would be scheme-specific.

The syntax of the url! type that LOAD accepts and MOLD generates would have one escaping model, which woul;d be scheme-independent. The internal data model could have another encoding method for Unicode, also scheme-independent as long as the url! actions returned decoded characters, and a way to mark escaped characters that is also scheme-independent.

The only scheme-dependent thing would be how the port scheme handlers translate the scheme-independent escaped data to the differently escaped data model that the implemented protocol requires. As long as the scheme can determine the difference between syntax and data using DECODE-URL, it can know what to translate to what it requires for its own external purposes.

rebolbot commented 11 years ago

Submitted by: Ladislav

'wouldn't do any octet encoding at all, they would only be concerned with escaping (which is currently done as a "%" and two hex digits). ' - you are just using a terminology incompatible with the RFC. According to the RFC #3986 the usage of '#"%" followed by two hex digits' is called "percent encoding", which is what I am respecting to maintain consistency.

rebolbot commented 11 years ago

Submitted by: BrianH

There is no RFC for Rebol syntax or semantics. We are not discussing a limited RFC-compatible external definition of a URL, we are discussing Rebol's definition (which I will call url! to avoid further confusion). The scheme handlers that need to generate RFC3986 URLs will convert url! values to the appropriate octet streams. The scheme handlers that wrap Unicode APIs (like ODBC) won't need to do that kind of conversion, and shouldn't.

On the other hand, using RFC3986 as the inspiration for the url! type's external representation (its syntax in Rebol source) is not a bad idea, even though url! values would be Unicode. Regardless of how we decide to implement escape characters internally, we can still have MOLD generate percent encoding (including the UTF-8 sequence encoding for Unicode characters, don't know the RFC for that, but see #1986), and LOAD understand it. There are real advantages to internally encoding these escape characters using a different escape value, particularly one of the non-printing characters rather than "%", and some disadvantages, but that is a different issue from how the url! value is treated by LOAD and MOLD. TO-STRING is another issue, which would depend on the internal escaping method for syntax characters in the url! type.

rebolbot commented 11 years ago

Submitted by: BrianH

Advantage to using percent-encoding internally:

rebolbot commented 11 years ago

Submitted by: BrianH

"LOAD is accepting an "external representation", which is a string (Unicode, in case of R3)."

Um, no. Unicode is not a binary encoding, it is a series of codepoints which may be encoded in one of any number of binary encodings. LOAD only loads UTF-8 binary data. When you LOAD a string! value, it converts the string to UTF-8 binary before it loads it. Just a clarification, it doesn't matter for the purposes of this ticket.

"I.e., LOAD does not convert from UTF-8 to Unicode, but, funnily enough, from Unicode to something resembling UTF-8 (it isn't exact UTF-8 due to the decisions I am describing in this ticket)."

Well, for the url! type that is a problem. However, if LOAD actually converted to UTF-8 that wouldn't be a problem, as long as escape sequences were handled somehow internally, and as long as any characters in the url! were decoded from UTF-8 before they were returned. No individual bytes visible to Rebol code, all UTF-8 encoding handled by the internal code, a true internally-variable-length string type. PICK, POKE and INDEX? no longer O(1), sorry.

Making the url! type use the same internal encoding as string!, plus the internal escape sequence, that would make PICK, POKE and INDEX? O(1) again. Unless we decided to implement url! and string! as UTF-8 or UTF-16 of course.

rebolbot commented 11 years ago

Submitted by: BrianH

"and where our points differ" - After your comment edits, our points differ in a completely different way now :)

rebolbot commented 8 years ago

Submitted by: Ladislav

"... whatever internal model we choose will need to be encoded from and decoded to Unicode characters when you are accessing the contents of the url! It would be best if that decoding is done in the url! type actions themselves and just providing a char! API, hiding the actual internal physical representation so as to avoid data corruption." - hmm, this looks natural, but it probably isn't possible. Any Unicode code point we find in the externally represented URL may either be a real Unicode, or it may be a sequence of octets looking like a Unicode code point only by chance. While in the former case it can be handled on a per-codepoint basis, in the latter case it is necessary to handle it on a per octet basis.