:toc: left
S-expressions, short for symbolic expressions, are the syntactic backbone of the Lisp programming language family. While each dialect uses a slightly different variant for its native syntax, there is broad agreement on the fundamentals. This document gives a precise specification and a rationale for a syntax that is close to a least common denominator.
Such a pidgin syntax is appropriate for the interchange of data files across language boundaries. It is also fit for code written in some of the kind of domain-specific languages that Lisp handles well.
The proposed syntax is simple to implement in full. A reader is about 200 lines of code in a typical high-level language. The authors hope this will encourage the use of S-expressions beyond the Lisp community.
S-expressions first appeared in the paper that introduced the Lisp programming language, Recursive Functions of Symbolic Expressions and Their Computation by Machine (McCarthy 1960). The expressions in that paper were made out of symbols and nested lists.
This seminal form of S-expressions has since been extended with syntax for additional data types. Among them are numbers, strings, boolean values, sets, and maps (also known as hash tables or dictionaries). The ability to write comments has also been added.
In designing a portable S-expression syntax, the first decision is which data types to support.
Lisp would lose its essence without the lists and symbols from McCarthy's paper. Any variant of S-expressions needs them.
Numbers and strings are essential for practical use.
Booleans are essential, but there is little agreement on how to represent them.
Beyond lists, other collection types such as vectors, sets, maps, and user-defined record types, are omitted.
Likewise, we omit explicit support for specialized data such as timestamps, units of measure, and file system pathnames. Values of these types are subject to so many idiosyncratic and changing rules that it's difficult to strike a balance between being comprehensive and being useful.
In everyday usage, the English word list means any sequence of elements. Programmers define lists with mathematical precision: zero-element, one-element, and two-element lists still count as lists.
The custom in Lisp is to implement a list as a singly linked list. This custom carries over to most functional programming languages. These languages use the word vector for sequences (implemented using arrays or trees) supporting faster than O(n) random access, and reserve the word list for O(n) linked lists.
A linked list is a sequence of pairs where the first part of each pair contains a list element, and the second part points to the next pair. (In the last pair of the list, the second part is nil.) In statically typed functional languages this rule is enforced by the type system. In Lisp the rule is merely a convention, and each part of a pair can actually point to any type of object. That relaxation of the rule gives rise to the following special cases:
It's possible to construct a list where the second part of the last pair is neither a pair nor nil. In Lisp, we call such lists dotted lists, and the last (possibly the only) pair is a dotted pair.
When pairs are mutable, it's possible to construct a never-ending circular list by making the second part of the last pair point back to an earlier pair in the list. Usually the last pair points back to the first pair, but it's possible for a list to start with a non-circular prefix which leads to a circular sublist.
Multiple lists can share sublists, and trees build out of pairs can share subtrees. Multiple pairs can point to the same pair in their second part.
A notation supporting these special cases needs two notational aids:
A consing dot that can be put before the final element of the list to make a dotted pair.
A syntax for labels and references to describe shared structure.
We choose not to include either of those in POSE. Most languages
outside of functional programming use vectors for their most idiomatic
list-like data type, and even functional programming langauges forbid
dotted pairs. We can also envision a future Lisp dialect where vectors
(instead of lists) are the basic sequence type. In the interest of
easier portability to all of these languages, it is better to avoid
the special cases. A proper list like (1 2 3)
is also easy for
non-lispers to read and write without special instruction.
One more restriction that statically typed languages have is that list elements all need to be of the same type. Since this is a semantic concern, not a syntax concern, we permit heterogeneous lists like Lisp and other dynamically languages do. In statically typed languages, a parser can be implemented by using a union type to cover the range of data types that can appear in S-expressions.
Lisp dialects have traditionally supported an especially large variety of numbers, made the usage convenient, and specified numerical results quite precisely. In Lisp tradition, a full numeric tower is:
Not all programming languages have complex numbers, and fewer still have ratios, making them problematic in a portable notation. Hence we stick to integers and floating point numbers in POSE.
Lisp dialects tend to transparently switch numbers between machine-word size fixnums to larger bignums as required. There is no special syntax for bignums. Clojure breaks with Lisp tradition here by having a syntactic suffix for bignums. POSE sticks to tradition, and does not distinguish bignums and fixnums.
Clojure adds another peculiarity to Lisp's traditionally clean number syntax: a leading zero causes a number to be interpreted in octal (base 8). This feature is borrowed from the C family, is a common source of confusion, and we argue that it is best avoided. However, it does s place a prerogative on the designer of a portable syntax to forbid leading zeros in order to avoid the ambiguity.
Some Lisp implementations permit putting underscores between groups of digits for readability. This is convenient but not yet common, so we omit this feature from POSE.
Symbols and strings are two very similar data types that Lisp has long
held separate. The name of a symbol is a string, but the symbol itself
is not a string. The name has to be deliberately extracted in order to
be handled as a string. In Common Lisp this is done by symbol-name
,
in Scheme by symbol\->string
.
Lisp started out with only symbols due to its origins in abstract computer science. Strings were added later. Most languages do not have a standard mapping from source code to the data structures in the language, and do not need a standard symbol type for that reason. All Lisp programs have a standard mapping to Lisp data, and symbols are the data type that corresponds to identifiers in Lisp programs.
In Lisp, strings are used for most user-defined data. Lispers continually entertain the thought of merging symbols and strings into one data type but it won't work out. Both types are needed.
Many Lisp dialects also have a character data type that is disjoint
from the string and integer types. In Common Lisp and Scheme this is
written as #\a #\A #\space
. Since dialects are not unanimous in
having a character type, and Unicode makes the concept of a character
somewhat dubious, we omit the syntax from POSE.
For a long time, there has been a debate over whether Lisp symbols should be case-sensitive or case-insensitive. (Strings are always case sensitive for obvious reasons; there is no debate in the community about them.) Lisp is old enough that symbols used to be written in uppercase. When the Common Lisp standard came around, it dictated that lowercase or mixed-case symbols shall be normalized to uppercase equivalents when read in, with a reader option to change this behavior. Scheme has not traditionally dictated symbol case, but lowercase is the default starting with R^6^RS (2007). Almost all Scheme implementations are now case sensitive by default, and use lowercase.
There is a clear long-term trend among programming languages that
case sensitivity is winning, and more and more Lisp dialects and
implementations are following suit. However,
since most data is sent using lower case symbols anyway,
and a case-insensitive recipient will throw away
the difference between foo
, Foo
, and FOO
if the native read
procedure is used, POSE
restricts letters in symbols to ASCII lower case only.
Common Lisp puts symbols in packages; a symbol has two parts: the
package name and the symbol name. These are separated by a colon as in
package:symbol
. If the colon is missing, the symbol is interned in
the current package. If the colon is present but the package name is
blank, the symbol is interned in the KEYWORD
package. Symbols in
this package are commonly known as keywords. A keyword is very much
like an ordinary symbol but tends to serve as a special syntactic
marker for things like named arguments in a function call.
Many Lisp implementations outside of Common Lisp also have keywords, but it varies whether they are a kind of symbol or a disjoint datatype.
Scheme syntax talks about identifiers instead of symbols. The distinction is important for hygienic macros, but is not important when dealing only with surface syntax, so we ignore identifiers here.
POSE symbols do not have package names, only symbol names. POSE does
not have keywords. In order to avoid problems with Common Lisp, POSE
does not permit the traditional package marker :
within symbol names.
However, :
is permitted at the beginning of a symbol for pragmatic
reasons: it is very common to use keywords in Common Lisp data files
to avoid dropping them into a random package when they are read.
However, a single :
is not a POSE symbol, nor is a token beginning
with more than one :
. POSE doesn't care if a symbol starts with :
or not, as long as :foo
is distinct from foo
.
Clojure uses the slash /
as package name delimiter, and permits only
one slash to appear in a symbol. We judge that this is too extreme a
deviation from Lisp tradition, and POSE freely permits /
in symbols.
The traditional way to parse Lisp is to start by treating a symbol and a number as the same type of token, and reserve a bunch of characters that may appear in such tokens. Each contiguous sequence of these characters is read as one token. The parser then tries to interpret the token as a number. If it succeeds, the token becomes that number. If it fails, the token becomes a symbol.
For portable data, that has the unfortunate side effect that the same token that parses as a symbol in one Lisp dialect can parse as a number in another.
For example, consider the standard functions to increment and
decrement a number. In Common Lisp they are called by the symbols 1+
and 1-
. Standard Scheme would try and fail to interpret those tokens
as numbers. MIT Scheme uses 1+
and -1+
equivalent to 1-
. Clojure
cannot read any of the preceding tokens as symbols, opting to spell
out the names inc
and dec
instead.
In POSE we use the following rule:
Any token starting with a digit 0..9
must be a valid number.
Any token starting with either +
or -
followed by a digit must
be a valid number.
Any other token is a symbol.
Lisp dialects have a famous ambiguity involving nil
. Two parentheses
()
are used to write an empty list. In many dialects an empty list
is equivalent to the symbol nil
. nil
further doubles as the
boolean value false, and a non-nil
object stands for boolean true.
Traditionally the symbol t
is reserved as the default choice for a
true object.
The above conventions are arguably a bit of a hack, and there are dialects that disagree with all of them. This makes it tricky to standardize booleans in a portable notation.
The following table shows what different dialects do:
|===
|Common Lisp |t
|nil
|Emacs Lisp |t
|nil
|Autolisp |t
|nil
|Picolisp |t
|nil
|Newlisp |true
|nil
|Clojure / EDN |true
|false
|Lisp Flavored Erlang |true
|false
|Janet |true
|false
|Fennel |true
|false
|Urn |true
|false
|Hy |True
|False
|Scheme (R7RS alternative) |#true
|#false
|Scheme |#t
|#f
|===
The least ambiguous choice is by Scheme: #t
and #f
are the two
values of a disjoint boolean data type; ()
is the empty list; and
the symbols nil
and t
have no special meaning. We could use these
conventions in a portable notation, but that would still leave an
ambiguity in how to read it into Lisp dialects that conflate nil
and
()
as the same object.
We choose to dodge the issue in POSE by not saying anything about
boolean values. Going by the above survey, the symbols true
/false
or t
/f
would make for a reasonable convention to represent
booleans, but this is non-normative.
POSE does not have sets or maps. Most substantial Lisp implementations have maps or hash-tables, but there is no standard read syntax for them. A set data type is not a de facto standard. Both sets and maps can be simulated with lists.
The suggested extension for a POSE file is .pose
. This appears to be
unused by any common program.
The tentative plan is to register the internet media type text/pose
with IANA. In the meantime, text/x-pose
is suggested.
expressions = (atmosphere expression) atmosphere*
atmosphere = whitespace | comment whitespace = HT | VT | FF | space | newline newline = CR | LF comment = ';' and all subsequent characters until newline or eof
expression = list | string | number | symbol
list = '(' expressions ')'
string = '"' string-char* '"' string-char = string-esc | any-char-except-backslash string-esc = \ | \"
number = integer fraction exponent integer = minus? digit | minus? onenine digits digits = digit digits? fraction = "" | '.' digits exponent = "" | echar sign? digits
symbol = wordsym | signsym | colonsym wordsym = wordsym-1st wordsym-cont wordsym-1st = letter | punct-1st wordsym-cont = letter | punct-cont | digit signsym = sign signsym-rest? signsym-rest = signsym-2nd signsym-cont signsym-2nd = letter | punct-cont signsym-cont = letter | punct-cont | digit colonsym = ':' wordsym punct-1st = '!' | '$' | '&' | '*' | '+' | '-' | '/' | '<' | '=' | '>' | '_' punct-cont = punct-1st | '.' | '?' | '@'