ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Add initial specification for selector syntax. #239

Closed creationix closed 3 years ago

creationix commented 4 years ago

This is a proposal for a selector syntax that closely models the semantics already in the IPLD format of selectors. The rational and constraints for the design are included in the documents as well as many examples and hopefully enough description of lexing/parsing behavior to make it unambiguous.

I would love feedback on what you like about this, what drives you crazy, and hopefully find out if this is a good direction.

creationix commented 4 years ago

After spending a day implementing a full parser, I've discovered the white-space rules are more interesting than I initially thought. The design goal was to make white-space 100% irrelevant to the parser. Most programming languages claim to not have significant white-space, but that's not entirely true for any real language.

The first exception obviously is white-space within strings needs to be preserved. This is easy enough, I simply turn off white-space stripping when within quoted sections.

The second case is one you typically don't realize. Most languages use white-space as token separators. For example:

# Escaped quotes
'This is ''one'' string with escaped quotes.'
# Escaped quotes or two strings?
'is this one string?' 'or two strings?`

The parsing of the second example depends on that space being between the two string literals. If we really ignore white-space, then it should parse the same if the space is removed.

Currently, this syntax really means it when it says no white-space, and the second will be a single string.

Another example is identifiers. In most languages foo bar is clearly two identifiers while foobar is one. In this language, we ignore the space entirely and both look identical to the parser. It needs other clues to know where one identifier/selector starts and where another ends. This is accomplished by the language having a fixed set of identifiers and careful design to make sure that there is never ambiguity.

For example we should avoid having a foo selector with short form f if there is also another selector who's short form is o. A developer may write foo and we might interpret it as f, o, o, or they might write f o o and we interpret it as foo.

This spec does specify which it should be (it would be always foo because it's longer). But we also want to avoid these situations whenever possible since humans are not compilers and may have different expectations since most/all existing languages don't really ignore white-space.

creationix commented 4 years ago

@warpfork One solution to the potential confusion with merged identifiers would be to preserve the white-space as tokens and tell the parser about them. But the problem with this is it would require those spaces to be preserved in compact mode. This could almost double the length of minimized version and include lots of spaces which can sometimes be problematic in URLs.

I'll consider it further though.

warpfork commented 4 years ago

Definitely meant my remark on that as a "2 cents". You've probably already thought about it much more than I have.

creationix commented 4 years ago

The initial lexer implementation is now done. The description in the spec for parsing out identifiers seems to be working. https://github.com/creationix/sel-parse-zig/blob/887c3628e11b4ff751e68a9ade4c18a9bf4daf25/src/lexer.zig

In particular, there is a hard coded list of identifiers expected here:

// Sorted by longest first, then lexical within same length.
const identifiers = .{
    "condition",
    "recursive",
    "recurse",
    "fields",
    "index",
    "match",
    "range",
    "union",
    "all",
    ".",
    "*",
    "~",
    "c",
    "f",
    "i",
    "r",
    "R",
    "u",
};

And then parsing those ends up being quite straightforward.

// Tokenize Identifiers
inline for (identifiers) |ident| {
    var matched = true;
    var i: u32 = 0;
    while (i < ident.len) {
        if (i >= input.len or ident[i] != input[i]) {
            matched = false;
            break;
        }
        i += 1;
    }
    if (matched) return Token{ .id = .Identifier, .slice = input[0..i] };
}
creationix commented 4 years ago

To get an idea of what the lexer tokens look like, this is the output of the following:

recursive(limit=5
  fields(
    'tree'(
      recursive(
        all(recurse)
      )
    )
    'parents'(
      all(recurse)
    )
  )
)
0       Id.Identifier   9B      `recursive`
9       Id.Open 1B      `(`
10      Id.Unknown      1B      `l`
11      Id.Identifier   1B      `i`
12      Id.Unknown      1B      `m`
13      Id.Identifier   1B      `i`
14      Id.Unknown      2B      `t=`
16      Id.Decimal      1B      `5`
17      Id.Whitespace   3B      `
  `
20      Id.Identifier   6B      `fields`
26      Id.Open 1B      `(`
27      Id.Whitespace   5B      `
    `
32      Id.String       6B      `'tree'`
38      Id.Open 1B      `(`
39      Id.Whitespace   7B      `
      `
46      Id.Identifier   9B      `recursive`
55      Id.Open 1B      `(`
56      Id.Whitespace   9B      `
        `
65      Id.Identifier   3B      `all`
68      Id.Open 1B      `(`
69      Id.Identifier   7B      `recurse`
76      Id.Close        1B      `)`
77      Id.Whitespace   7B      `
      `
84      Id.Close        1B      `)`
85      Id.Whitespace   5B      `
    `
90      Id.Close        1B      `)`
91      Id.Whitespace   5B      `
    `
96      Id.String       9B      `'parents'`
105     Id.Open 1B      `(`
106     Id.Whitespace   7B      `
      `
113     Id.Identifier   3B      `all`
116     Id.Open 1B      `(`
117     Id.Identifier   7B      `recurse`
124     Id.Close        1B      `)`
125     Id.Whitespace   5B      `
    `
130     Id.Close        1B      `)`
131     Id.Whitespace   3B      `
  `
134     Id.Close        1B      `)`
135     Id.Whitespace   1B      `
`
136     Id.Close        1B      `)`
rvagg commented 4 years ago

I really like how the two forms are really just one form, with shortenings (and no comments for URL form), but we're trading that against ease of implementation though, so every environment that needs to be able to use these would need to have a custom parser written. I doesn't seem difficult, but it's worth noting as we expand our language support (Go, JS and now Rust, but also Filecoin is using selectors and being implemented in C++ and maybe others too and this might be something they want at some point) that we're making that exchange.

creationix commented 4 years ago

@rvagg FWIW, I'm working on two concurrent implementations. One in vanilla JS and one using zig-lang which can be compiled to webassembly or a C ABI library for use in virtually any language. I plan to tweaking this spec with my findings from the two parsers to ensure it's not more difficult than it needs to be.

creationix commented 4 years ago

So while implementing multiple versions of the parser, I'm getting more and more convinced this spec needs to be more automated and formalized. I'm now spiking on generating a syntax grammer that can be automatically derived from the selector schema directly. The design will still be similar to what's proposed here, but it should be more consistent and less hand-crafted to make tooling across the board easier.

creationix commented 4 years ago

Ok, the JS parser can now correctly compile all the sample selectors in this spec. The new approach worked well. Basically, I load the existing IPLD Schema for selectors and generate a parser from that. In order to match the proposed syntax, the parser generator accepts a list of "aliases" for various types. For example, this is the line in JS that generates the selector syntax parser:

const schema = schemaParse(readFileSync(`${__dirname}/selectors.ipldsch`, 'utf8'))

const parseSelectorEnvelope = makeParser(schema, "SelectorEnvelope", {
  Matcher: ['match', '.'],
  ExploreAll: ['all', '*'],
  ExploreFields: ['fields', 'f'],
  ExploreIndex: ['index', 'i'],
  ExploreRange: ['range', 'r'],
  ExploreRecursive: ['recursive', 'R'],
  ExploreUnion: ['union', 'u'],
  ExploreConditional: ['condition', 'c'],
  ExploreRecursiveEdge: ['recurse', '~'],
  RecursionLimit_None: ['none', 'n'],
})

The types mentioned are existing in the schema, I'm creating a semi-automated DSL by specifying the entry point and long and short keywords for some types.

Note that this library could be used to create a DSL for any IPLD data structure that has a schema.

creationix commented 4 years ago

I found a case where the parentheses are significant and can't always be removed when converting to short form. Consider the following selector:

union(
  union(
    match
  )
  match
)

Sinceunion (aka ExploreUnion) contains a list (aka [Selector]), it consumes an arbitrary number of selectors. If the parentheses are removed, the two matches will be put under the inner union. Therefore the short form of this is:

# Correct short form
uu(.).

# Wrong short form
uu..

We could keep it simple and say the minimizer always preserves parentheses when encoding a list. I don't even know of any real world use cases that use ExploreUnion and would love to see the actual use cases. A smarter minimizer could do some basic analysis to know when the parentheses are required and only include them then.

creationix commented 4 years ago

Another case where they are required is labeled matchers inside of fields.

# Fields with labels
fields(
  'with-label'(
    match('label')
  )
  'without-label'(
    match
  )
)

# Properly Minimized
f'with-label'(.'label')'without-label'.

# Another Properly Minimized
f'with-label'.('label')'without-label'.

# Broken minimized
f'with-label'.'label''without-label'.

This brings up another question about normalization of minimized form. Are we OK with there being multiple correct short forms? Does it matter?

creationix commented 4 years ago

@rvagg, you were right!

Strings are problematic too. The less common escaping method used for strings (two single quotes) works, but it also introduces cases where multiple token are merged. For example, consider the following:

fields
  'foo'
    match 
      label='blue'
  'bar'
    match

This currently breaks because the 'blue' label and the 'bar' field are merged into a single label containing "blue'bar".

Sample:

    fields
      'with-label'
        match 
          label='my-label'
      'another-field'
        match

SyntaxError: "fields'foo'matchlabel='blue''bar'match"
                                               ^ Unexpected extra syntax at 33

I propose we switch to a more traditional string syntax with backslash escaping. It will add some bloat when url encoding, but overall should be an improvement.

rvagg commented 4 years ago

This brings up another question about normalization of minimized form. Are we OK with there being multiple correct short forms? Does it matter?

I don't know specifically for this but we keep on finding cases elsewhere where we don't have one-single-way and this being a problem. I don't know how that would show up here, but maybe if someone chose to encode a selector string rather than a full selector as per schema then having more than one way to say the same thing might be a problem. There's a lot of byte-shavers around that will look at this work and look at the full selector schema and opt for a short string form.

Would it be a big deal to make a must be simplest accurate representation rule that would give us one-single-way?

Re ExploreUnion, it's a selection strategy that is only useful when you have more than one selector to use isn't it? For practical purposes you should always have >1 item in the list, making parens always necessary and your example overly simple. There are going to be cases of poorly crafted selectors that misuse it, but that shouldn't be the normal case.

mikeal commented 3 years ago

We’ve learned a lot from this but we’re not quite sure how we want to handle simplified string representations for selectors and paths. Closing for now.