s-expressionists / wscl

Sources of the "Well Specified Common Lisp" specification which is based on the final draft of the Common Lisp standard but is not a new Common Lisp standard.
https://s-expressionists.github.io/wscl/
Other
38 stars 4 forks source link

A package named "", ||:symbol #63

Open Gleefre opened 5 months ago

Gleefre commented 5 months ago

The standard doesn't explicitly prohibits a package named "", which means that it is allowed.

It is not however clear what ||:symbol means; and if :symbol is allowed to be interpreted as ||:symbol, defining "" as a global nickname for the "KEYWORD" package (as AllegroCL does).

A simple test (from https://plaster.tymoon.eu/view/4474):

(let ((pkg (make-package "" :use nil)))
  (export (intern "SYMBOL" pkg) pkg)
  (package-name (symbol-package (read-from-string "||:symbol"))))

; SBCL, CMUCL, CCL =>  ""
; ECL, CLASP, CLISP, MKCL, Corman Lisp, LispWorks, JSCL  =>  "KEYWORD"
; ACL >> A package named "" already exists
; ABCL >> java.lang.IndexOutOfBoundsException: fromIndex: 1 > toIndex: 0

Currently ACL interprets :symbol as ||:symbol, defining a global nickname "" for the keyword package to implement the special syntax for keywords.

SBCL, CMUCL and CCL interpret :symbol and ||:symbol differently, treating the second one as a symbol lookup in the package named "".

ECL, CLASP, CLISP, MKCL, Corman Lisp, LispWorks, JSCL interpret ||:symbol as :symbol - as a keyword.

It is not clear what ABCL does, since it just crashes trying to read ||:symbol.

Gleefre commented 4 months ago

From a discussion on #commonlisp IRC channel:

Gleefre commented 4 months ago

Also note that not defining such an extension leads to losing print-read consistency, since a symbol in a package named "" no longer can be readably printed.

Such a symbol is currently printed either as ||::foo and ||:bar; or ::foo and :bar. Note that without such an extension, all of these can only be read as a keyword. Note also that ::foo and ||::foo have unspecified consequences as per CLHS 2.3.5

See this paste for an example: https://plaster.tymoon.eu/view/4480

Gleefre commented 4 months ago

There is a similar problem with potential numbers (see CLHS 2.3.1.1.1).

The example of interest here is 5||, which is specifically said to be interpreted as a symbol. However, the reader algorithm implies that it should be read as a token "5", where the character #\5 has its usual syntactic qualities, and thus this token should be interpreted as a number.

Also, the following remark:

In each case, removing the escape character (or characters) would cause the token to be a potential number.

seems to imply that escape characters are actually part of the token, although it could just be a wording issue.

(The token in question is said to be a potential number, which means that a potential number is a token. It is also previously said that "a potential number cannot contain any escape characters", which implies that generally a token can contain escape characters, and thus escape characters are included in the token.)

P.S. The same can be said about ||. which reads as a symbol |.| and not a consing dot. Or at least all implementations I can test it on seem to agree on that -- I wasn't able to find it being specifically mentioned in the hyperspec.

informatimago commented 4 months ago

Nope. 5|| is explicitly rejected as syntax for potential numbers by https://www.lispworks.com/documentation/lw61/CLHS/Body/02_caaa.htm

This is to stay consistent with things like 5|x| or 5|2| which must both be interpreted as symbols.

Having an explicit rule overrides any default interpretation by the algorithm.

Gleefre commented 4 months ago

I don't claim that 5|| is undefined -- it is indeed specifically said to be read as a symbol. However, the reason why it is not interpreted as a number given by the spec is invalid for the 5||:

An escape character robs the following character of all syntactic qualities, forcing it to be strictly alphabetic[2] and therefore unsuitable for use in a potential number.

This is not true for 5||, as the only character in the token "5" is not an escaped character and thus keeps all of its syntactic qualities.

What I do claim, is that it means that the reader algorithm is not correctly defined, and thus needs to be changed.

Gleefre commented 4 months ago

A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):

If a token consists solely of dots (with no escape characters), then <...>

informatimago commented 4 months ago

A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):

If a token consists solely of dots (with no escape characters), then <...>

"As IF".

But indeed, tokens are not mere strings. A token must remember the character trait of each character. And indeed, to handle the 5|| rule, a token must also remember that it has occurrences of multiple-escapes even if they're empty, so it needs at least one flag in addition to the characters and traits. https://www.lispworks.com/documentation/lw61/CLHS/Body/02_adb.htm

Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.

For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".

... -> only #. characters with the character trait dot. ... -> only #. characters, but one with the character trait alphabetic (the escaped one).

And yes, the reader algorithm is not formally specified down to these details, because there are various implementation choices possible to implement things like 2.3.3 or 2.3.1.1.1.

In the context of wscl, the question is whether there are any ambiguity in the specification, not whether the specification allows various implementations (all behaving the same way).

So far I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways when processed by the lisp reader as specified (applying all the rules).

Gleefre commented 4 months ago

a token must also remember that it has occurrences of multiple-escapes

And that contradicts the hyperspec, as CLHS 2.2 Reader Algorithm explicitly specifies how tokens are constructed (the phrases used are "y is used to begin a token", "Y is appended to the token being built" e.t.c).

The very fact that one part of the spec (namely 2.3.1.1.1) contradicts the other one (2.2) already makes it a WSCL issue; since any "rephrasing" already creates ambiguity.

Note that the whole thing with numbers and the consing dot is only "supporting evidence", that is intended to provide context to the main issue -- interpretation of ||:xxxx and ||::xxxx.

I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways

By the way, the string "||:xxxx" and "||::xxxx" is exactly what are asking for. I believe I have already demonstrated how there are three different possible interpretations, based on different "rephrasing" of the contradicting pars of the specification.

Gleefre commented 4 months ago

Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.

Just in case, by saying that escape characters are part of the token, I don't necessarily mean that they must be put directly into the token being accumulated. It would be enough to, for example, keep track of escaped "intervals" of the form [a, b), with "empty" pairs of multiple escape characters appending an "empty interval" of the form [x, x). [ note: AFAICT this is what Eclector does ]

For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".

This rephrasing would mean that ||. must be read as a consing dot, and not as a symbol. This is not the case for all implementations I tested it on -- sbcl, cmucl, ccl, allegro cl, ecl, clasp, abcl, clisp, mkcl, lispworks, corman cl, jscl.

Gleefre commented 4 months ago

"As IF".

Well, there are many other examples. Here's what I have found so far (including previous ones for completeness):

CLHS 2.1.4 Character Syntax Types:

Constituent and escape characters are accumulated to make a token <...>

CLHS 2.3.1.1.1 Escape Characters and Potential Numbers:

A potential number cannot contain any escape characters.

<...> removing the escape character (or characters) would cause the token to be a potential number.

Note: potential number is a special kind of token (CLHS 2.3.1.1 Potential Numbers as Tokens):

A token is a potential number if <...>

CLHS 2.3.3 The Consing Dot:

If a token consists solely of dots (with no escape characters), then <...>

CLHS 2.4.1 Left-Parenthesis:

If a token that is just a dot not immediately preceded by an escape character is read after some object then

CLHS 2.4.8.4 Sharpsign Asterisk:

Neither a single escape nor a multiple escape is permitted in this token.