Portable S-expressions (POSE)

:toc: left

Abstract

S-expressions, short for symbolic expressions, are the syntactic backbone of the Lisp programming language family. While each dialect uses a slightly different variant for its native syntax, there is broad agreement on the fundamentals. This document gives a precise specification and a rationale for a syntax that is close to a least common denominator.

Such a pidgin syntax is appropriate for the interchange of data files across language boundaries. It is also fit for code written in some of the kind of domain-specific languages that Lisp handles well.

The proposed syntax is simple to implement in full. A reader is about 200 lines of code in a typical high-level language. The authors hope this will encourage the use of S-expressions beyond the Lisp community.

Rationale

The origin of S-expressions

S-expressions first appeared in the paper that introduced the Lisp programming language, Recursive Functions of Symbolic Expressions and Their Computation by Machine (McCarthy 1960). The expressions in that paper were made out of symbols and nested lists.

This seminal form of S-expressions has since been extended with syntax for additional data types. Among them are numbers, strings, boolean values, sets, and maps (also known as hash tables or dictionaries). The ability to write comments has also been added.

Data types

In designing a portable S-expression syntax, the first decision is which data types to support.

Lisp would lose its essence without the lists and symbols from McCarthy's paper. Any variant of S-expressions needs them.

Numbers and strings are essential for practical use.

Booleans are essential, but there is little agreement on how to represent them.

Beyond lists, other collection types such as vectors, sets, maps, and user-defined record types, are omitted.

Likewise, we omit explicit support for specialized data such as timestamps, units of measure, and file system pathnames. Values of these types are subject to so many idiosyncratic and changing rules that it's difficult to strike a balance between being comprehensive and being useful.

Lists and vectors

In everyday usage, the English word list means any sequence of elements. Programmers define lists with mathematical precision: zero-element, one-element, and two-element lists still count as lists.

The custom in Lisp is to implement a list as a singly linked list. This custom carries over to most functional programming languages. These languages use the word vector for sequences (implemented using arrays or trees) supporting faster than O(n) random access, and reserve the word list for O(n) linked lists.

A linked list is a sequence of pairs where the first part of each pair contains a list element, and the second part points to the next pair. (In the last pair of the list, the second part is nil.) In statically typed functional languages this rule is enforced by the type system. In Lisp the rule is merely a convention, and each part of a pair can actually point to any type of object. That relaxation of the rule gives rise to the following special cases:

It's possible to construct a list where the second part of the last pair is neither a pair nor nil. In Lisp, we call such lists dotted lists, and the last (possibly the only) pair is a dotted pair.
When pairs are mutable, it's possible to construct a never-ending circular list by making the second part of the last pair point back to an earlier pair in the list. Usually the last pair points back to the first pair, but it's possible for a list to start with a non-circular prefix which leads to a circular sublist.
Multiple lists can share sublists, and trees build out of pairs can share subtrees. Multiple pairs can point to the same pair in their second part.

A notation supporting these special cases needs two notational aids:

A consing dot that can be put before the final element of the list to make a dotted pair.
A syntax for labels and references to describe shared structure.

We choose not to include either of those in POSE. Most languages outside of functional programming use vectors for their most idiomatic list-like data type, and even functional programming langauges forbid dotted pairs. We can also envision a future Lisp dialect where vectors (instead of lists) are the basic sequence type. In the interest of easier portability to all of these languages, it is better to avoid the special cases. A proper list like (1 2 3) is also easy for non-lispers to read and write without special instruction.

One more restriction that statically typed languages have is that list elements all need to be of the same type. Since this is a semantic concern, not a syntax concern, we permit heterogeneous lists like Lisp and other dynamically languages do. In statically typed languages, a parser can be implemented by using a union type to cover the range of data types that can appear in S-expressions.

Numbers

Lisp dialects have traditionally supported an especially large variety of numbers, made the usage convenient, and specified numerical results quite precisely. In Lisp tradition, a full numeric tower is:

Integer (negative and nonnegative, arbitrarily large magnitude).
Ratio of two integers.
Real number (implemented as floating point).
Complex number.

Not all programming languages have complex numbers, and fewer still have ratios, making them problematic in a portable notation. Hence we stick to integers and floating point numbers in POSE.

Lisp dialects tend to transparently switch numbers between machine-word size fixnums to larger bignums as required. There is no special syntax for bignums. Clojure breaks with Lisp tradition here by having a syntactic suffix for bignums. POSE sticks to tradition, and does not distinguish bignums and fixnums.

Clojure adds another peculiarity to Lisp's traditionally clean number syntax: a leading zero causes a number to be interpreted in octal (base 8). This feature is borrowed from the C family, is a common source of confusion, and we argue that it is best avoided. However, it does s place a prerogative on the designer of a portable syntax to forbid leading zeros in order to avoid the ambiguity.

Some Lisp implementations permit putting underscores between groups of digits for readability. This is convenient but not yet common, so we omit this feature from POSE.

Symbols, strings, characters, and byte vectors

Symbols and strings are two very similar data types that Lisp has long held separate. The name of a symbol is a string, but the symbol itself is not a string. The name has to be deliberately extracted in order to be handled as a string. In Common Lisp this is done by symbol-name, in Scheme by symbol\->string.

Lisp started out with only symbols due to its origins in abstract computer science. Strings were added later. Most languages do not have a standard mapping from source code to the data structures in the language, and do not need a standard symbol type for that reason. All Lisp programs have a standard mapping to Lisp data, and symbols are the data type that corresponds to identifiers in Lisp programs.

In Lisp, strings are used for most user-defined data. Lispers continually entertain the thought of merging symbols and strings into one data type but it won't work out. Both types are needed.

Many Lisp dialects also have a character data type that is disjoint from the string and integer types. In Common Lisp and Scheme this is written as #\a #\A #\space. Since dialects are not unanimous in having a character type, and Unicode makes the concept of a character somewhat dubious, we omit the syntax from POSE.

Case sensitivity in symbols

For a long time, there has been a debate over whether Lisp symbols should be case-sensitive or case-insensitive. (Strings are always case sensitive for obvious reasons; there is no debate in the community about them.) Lisp is old enough that symbols used to be written in uppercase. When the Common Lisp standard came around, it dictated that lowercase or mixed-case symbols shall be normalized to uppercase equivalents when read in, with a reader option to change this behavior. Scheme has not traditionally dictated symbol case, but lowercase is the default starting with R^6^RS (2007). Almost all Scheme implementations are now case sensitive by default, and use lowercase.

There is a clear long-term trend among programming languages that case sensitivity is winning, and more and more Lisp dialects and implementations are following suit. However, since most data is sent using lower case symbols anyway, and a case-insensitive recipient will throw away the difference between foo, Foo, and FOO if the native read procedure is used, POSE restricts letters in symbols to ASCII lower case only.

Symbol packages, identifiers, and keywords

Common Lisp puts symbols in packages; a symbol has two parts: the package name and the symbol name. These are separated by a colon as in package:symbol. If the colon is missing, the symbol is interned in the current package. If the colon is present but the package name is blank, the symbol is interned in the KEYWORD package. Symbols in this package are commonly known as keywords. A keyword is very much like an ordinary symbol but tends to serve as a special syntactic marker for things like named arguments in a function call.

Many Lisp implementations outside of Common Lisp also have keywords, but it varies whether they are a kind of symbol or a disjoint datatype.

Scheme syntax talks about identifiers instead of symbols. The distinction is important for hygienic macros, but is not important when dealing only with surface syntax, so we ignore identifiers here.

POSE symbols do not have package names, only symbol names. POSE does not have keywords. In order to avoid problems with Common Lisp, POSE does not permit the traditional package marker : within symbol names.

However, : is permitted at the beginning of a symbol for pragmatic reasons: it is very common to use keywords in Common Lisp data files to avoid dropping them into a random package when they are read. However, a single : is not a POSE symbol, nor is a token beginning with more than one :. POSE doesn't care if a symbol starts with : or not, as long as :foo is distinct from foo.

Clojure uses the slash / as package name delimiter, and permits only one slash to appear in a symbol. We judge that this is too extreme a deviation from Lisp tradition, and POSE freely permits / in symbols.

Distinguishing numbers from symbols

The traditional way to parse Lisp is to start by treating a symbol and a number as the same type of token, and reserve a bunch of characters that may appear in such tokens. Each contiguous sequence of these characters is read as one token. The parser then tries to interpret the token as a number. If it succeeds, the token becomes that number. If it fails, the token becomes a symbol.

For portable data, that has the unfortunate side effect that the same token that parses as a symbol in one Lisp dialect can parse as a number in another.

For example, consider the standard functions to increment and decrement a number. In Common Lisp they are called by the symbols 1+ and 1-. Standard Scheme would try and fail to interpret those tokens as numbers. MIT Scheme uses 1+ and -1+ equivalent to 1-. Clojure cannot read any of the preceding tokens as symbols, opting to spell out the names inc and dec instead.

In POSE we use the following rule:

Any token starting with a digit 0..9 must be a valid number.
Any token starting with either + or - followed by a digit must be a valid number.
Any other token is a symbol.

Booleans, nil, and the empty list

Lisp dialects have a famous ambiguity involving nil. Two parentheses () are used to write an empty list. In many dialects an empty list is equivalent to the symbol nil. nil further doubles as the boolean value false, and a non-nil object stands for boolean true. Traditionally the symbol t is reserved as the default choice for a true object.

The above conventions are arguably a bit of a hack, and there are dialects that disagree with all of them. This makes it tricky to standardize booleans in a portable notation.

The following table shows what different dialects do:

The least ambiguous choice is by Scheme: #t and #f are the two values of a disjoint boolean data type; () is the empty list; and the symbols nil and t have no special meaning. We could use these conventions in a portable notation, but that would still leave an ambiguity in how to read it into Lisp dialects that conflate nil and () as the same object.

We choose to dodge the issue in POSE by not saying anything about boolean values. Going by the above survey, the symbols true/false or t/f would make for a reasonable convention to represent booleans, but this is non-normative.

Sets and maps

POSE does not have sets or maps. Most substantial Lisp implementations have maps or hash-tables, but there is no standard read syntax for them. A set data type is not a de facto standard. Both sets and maps can be simulated with lists.

Specification

File name extension

The suggested extension for a POSE file is .pose. This appears to be unused by any common program.

MIME type

The tentative plan is to register the internet media type text/pose with IANA. In the meantime, text/x-pose is suggested.

Grammar

expressions = (atmosphere expression) atmosphere*

expression = list | string | number | symbol

list = '(' expressions ')'

string = '"' string-char* '"' string-char = string-esc | any-char-except-backslash string-esc = \ | \"

number = integer fraction exponent integer = minus? digit | minus? onenine digits digits = digit digits? fraction = "" | '.' digits exponent = "" | echar sign? digits

symbol = wordsym | signsym | colonsym wordsym = wordsym-1st wordsym-cont wordsym-1st = letter | punct-1st wordsym-cont = letter | punct-cont | digit signsym = sign signsym-rest? signsym-rest = signsym-2nd signsym-cont signsym-2nd = letter | punct-cont signsym-cont = letter | punct-cont | digit colonsym = ':' wordsym punct-1st = '!' | '$' | '&' | '*' | '+' | '-' | '/' | '<' | '=' | '>' | '_' punct-cont = punct-1st | '.' | '?' | '@'

s-expressions / pose

readme