Open Gleefre opened 5 months ago
From a discussion on #commonlisp
IRC channel:
Defining a new global nickname for the #:KEYWORD
package, as Allegro CL does, is not standard-compliant.
See CLHS 11.1.2 Figure 11-2 "Standardized Package Names"
This figure lists all the nicknames of the #:KEYWORD
package - none
.
The reader algorithm (CLHS 2.2) seems to imply that multiple escape characters (namely #\|
) are resolved before interpreting a token. That suggests that ||:xxxx
is going to form a token :XXXX
which then should be interpreted as a keyword as per CLHS 2.3.5.
That being said, it would be somewhat useful to define ||:xxxx
as a special syntax for a symbol in a package named by an empty string, since there is no other way to read in such a symbol (without using #.
syntax). More precisely, if a multiple escape character was used before the first package marker in a token, it should not be interpreted as a keyword symbol, but as a symbol in a package, possibly named by an empty string.
See this commit in SBCL's repo which added that feature. See also this comment in CCL's codebase. As we can see, historically MCL had this feature as well. Also note that CMUCL has accidentally included this feature when backporting float reader from SBCL, as can be seen in this commit.
Also note that not defining such an extension leads to losing print-read consistency, since a symbol in a package named ""
no longer can be readably printed.
Such a symbol is currently printed either as ||::foo
and ||:bar
; or ::foo
and :bar
. Note that without such an extension, all of these can only be read as a keyword. Note also that ::foo
and ||::foo
have unspecified consequences as per CLHS 2.3.5
See this paste for an example: https://plaster.tymoon.eu/view/4480
There is a similar problem with potential numbers (see CLHS 2.3.1.1.1).
The example of interest here is 5||
, which is specifically said to be interpreted as a symbol. However, the reader algorithm implies that it should be read as a token "5"
, where the character #\5
has its usual syntactic qualities, and thus this token should be interpreted as a number.
Also, the following remark:
In each case, removing the escape character (or characters) would cause the token to be a potential number.
seems to imply that escape characters are actually part of the token, although it could just be a wording issue.
(The token in question is said to be a potential number, which means that a potential number is a token. It is also previously said that "a potential number cannot contain any escape characters", which implies that generally a token can contain escape characters, and thus escape characters are included in the token.)
P.S. The same can be said about ||.
which reads as a symbol |.|
and not a consing dot.
Or at least all implementations I can test it on seem to agree on that -- I wasn't able to find it being specifically mentioned in the hyperspec.
Nope. 5|| is explicitly rejected as syntax for potential numbers by https://www.lispworks.com/documentation/lw61/CLHS/Body/02_caaa.htm
This is to stay consistent with things like 5|x| or 5|2| which must both be interpreted as symbols.
Having an explicit rule overrides any default interpretation by the algorithm.
I don't claim that 5||
is undefined -- it is indeed specifically said to be read as a symbol. However, the reason why it is not interpreted as a number given by the spec is invalid for the 5||
:
An escape character robs the following character of all syntactic qualities, forcing it to be strictly alphabetic[2] and therefore unsuitable for use in a potential number.
This is not true for 5||
, as the only character in the token "5" is not an escaped character and thus keeps all of its syntactic qualities.
What I do claim, is that it means that the reader algorithm is not correctly defined, and thus needs to be changed.
A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):
If a token consists solely of dots (with no escape characters), then <...>
A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):
If a token consists solely of dots (with no escape characters), then <...>
"As IF".
But indeed, tokens are not mere strings. A token must remember the character trait of each character. And indeed, to handle the 5|| rule, a token must also remember that it has occurrences of multiple-escapes even if they're empty, so it needs at least one flag in addition to the characters and traits. https://www.lispworks.com/documentation/lw61/CLHS/Body/02_adb.htm
Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.
For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".
... -> only #. characters with the character trait dot. ... -> only #. characters, but one with the character trait alphabetic (the escaped one).
And yes, the reader algorithm is not formally specified down to these details, because there are various implementation choices possible to implement things like 2.3.3 or 2.3.1.1.1.
In the context of wscl, the question is whether there are any ambiguity in the specification, not whether the specification allows various implementations (all behaving the same way).
So far I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways when processed by the lisp reader as specified (applying all the rules).
a token must also remember that it has occurrences of multiple-escapes
And that contradicts the hyperspec, as CLHS 2.2 Reader Algorithm explicitly specifies how tokens are constructed (the phrases used are "y is used to begin a token", "Y is appended to the token being built" e.t.c).
The very fact that one part of the spec (namely 2.3.1.1.1) contradicts the other one (2.2) already makes it a WSCL issue; since any "rephrasing" already creates ambiguity.
Note that the whole thing with numbers and the consing dot is only "supporting evidence", that is intended to provide context to the main issue -- interpretation of ||:xxxx
and ||::xxxx
.
I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways
By the way, the string "||:xxxx"
and "||::xxxx"
is exactly what are asking for. I believe I have already demonstrated how there are three different possible interpretations, based on different "rephrasing" of the contradicting pars of the specification.
Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.
Just in case, by saying that escape characters are part of the token, I don't necessarily mean that they must be put directly into the token being accumulated. It would be enough to, for example, keep track of escaped "intervals" of the form [a, b), with "empty" pairs of multiple escape characters appending an "empty interval" of the form [x, x). [ note: AFAICT this is what Eclector does ]
For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".
This rephrasing would mean that ||.
must be read as a consing dot, and not as a symbol. This is not the case for all implementations I tested it on -- sbcl, cmucl, ccl, allegro cl, ecl, clasp, abcl, clisp, mkcl, lispworks, corman cl, jscl.
"As IF".
Well, there are many other examples. Here's what I have found so far (including previous ones for completeness):
CLHS 2.1.4 Character Syntax Types:
Constituent and escape characters are accumulated to make a token <...>
CLHS 2.3.1.1.1 Escape Characters and Potential Numbers:
A potential number cannot contain any escape characters.
<...> removing the escape character (or characters) would cause the token to be a potential number.
Note: potential number is a special kind of token (CLHS 2.3.1.1 Potential Numbers as Tokens):
A token is a potential number if <...>
If a token consists solely of dots (with no escape characters), then <...>
If a token that is just a dot not immediately preceded by an escape character is read after some object then
CLHS 2.4.8.4 Sharpsign Asterisk:
Neither a single escape nor a multiple escape is permitted in this token.
The standard doesn't explicitly prohibits a package named
""
, which means that it is allowed.It is not however clear what
||:symbol
means; and if:symbol
is allowed to be interpreted as||:symbol
, defining""
as a global nickname for the"KEYWORD"
package (as AllegroCL does).A simple test (from https://plaster.tymoon.eu/view/4474):
Currently ACL interprets
:symbol
as||:symbol
, defining a global nickname""
for the keyword package to implement the special syntax for keywords.SBCL, CMUCL and CCL interpret
:symbol
and||:symbol
differently, treating the second one as a symbol lookup in the package named""
.ECL, CLASP, CLISP, MKCL, Corman Lisp, LispWorks, JSCL interpret
||:symbol
as:symbol
- as a keyword.It is not clear what ABCL does, since it just crashes trying to read
||:symbol
.