trueagi-io / hyperon-experimental

MeTTa programming language implementation
https://metta-lang.dev
MIT License
153 stars 50 forks source link

The space character is not properly parsed #633

Open ngeiswei opened 8 months ago

ngeiswei commented 8 months ago

What is your problem?

The space character, represented as ' ' in MeTTa, is not properly parsed.

How to reproduce your problem?

Run the following

! ' '

What would you normally expect?

[' ']

What do you get instead?

[']

What do you have to say?

I think the problem is that ' ' is understood as two separated '. This is consistent with the fact that

! 2 3

outputs [2].

I had a look at the definition of type_tokens in stdlib.py but I cannot see what could be wrong, the problem might be buried inside the Rust parser.

luketpeterson commented 8 months ago

I am confused about what the correct behavior should be here. the single quote character has no special meaning inside the s-expression parser, and spaces delineate unquoted symbols (unquoted with double quotes) so ' ' is just parsed as the symbol atom ' twice. So that's the current behavior.

The question is: what behavior do you want? Specifically what are you trying to achieve? ie. what do you want to do with the space?

vsbogd commented 8 months ago

Nil introduced "\'[^\']\'" token to create a character grounded type. It doesn't work for spaces indeed because Rust parser uses space symbols as a delimiter for lexems. It does work for other characters though.

vsbogd commented 8 months ago

I would say we could allow using escaping '\x20' to allow user adding spaces (or other characters) as a part of the token.

luketpeterson commented 8 months ago

Nil introduced "\'[^\']\'" token to create a character grounded type...

I see.

I would say we could allow using escaping '\x20' to allow user adding spaces (or other characters) as a part of the token.

Currently the escape evaluation only happens as part of string parsing. (between double quotes)

IMO, We can go one of two ways to fix this issue in general:

  1. We can make the single-quote character a second type of quote character at the parser level. This is the direction Python, Javascript, Perl, and many other languages have gone. The downside is that we give up the ability to have a solitary quote as a special sigil (Like Rust's 'label syntax used for lifetimes, etc.)

If we go this way, A. we can take a page from Perl's book and use the single-quote blocks as raw strings (ie. no escape substitution) or B. we can go the Python route and treat them the same and just require the opening and closing delimiter to match.

  1. We can extend escaping to un-quoted symbols, so @vsbogd suggestion above would work

I have a strong preference for 1 over 2. But I don't have a preference between 1A. and 1B. What do you think?

vsbogd commented 8 months ago

Trying to brainstorm:

Supporting common ways of quoting is a possible solution. On the other hand program authors can invent unusual quoting. We could also say that we support only double quoting for instance. If author of the token needs using spaces inside its token then token should be wrapped by double quotes. For instance in this case the possible solution is to say that character literal has form "' '" (in fact the tokenizer will receive the whole token including double quotes). This introduces special case for the strings indeed.

One possibility is to use some rare character to introduce universal quoting (anyway we probably should support double quotes for strings as strings are very common). For instance backtick `. We could add additional rule that quoting can be started only at a lexem start (if it is not true yet). Thus the only way to quote something is using backtick after a space. Thus one still can use tokens like some`thing`. It is still not very convenient to use backticks for a character literals: `' '`

We could allow escaping only for a space characters between lexems like (print '\ \t')? In this special case space characters becomes a part of the lexem. Double backslash \\ is converted to a backslash \ and doesn't escape the following space character. Backslash followed by non-space character doesn't escape it. Then spaces without escaping can be used only inside double quotes (which are converted to strings). For example:

'\ ' ; space character
"string with spaces and \\ backslashes"
atom-ends-by-backslash\\
atom\ with\ spaces\ inside
atom\with\backslashes\inside

This is almost equivalent to the full escaping support for the code, but allows using backslashes without duplication (which doesn't have much value I believe).

@luketpeterson , I am not quite understand why do you prefer entering another quoting over escaping inside an input stream of the characters for the parser. Could you please elaborate?

luketpeterson commented 8 months ago

On the other hand program authors can invent unusual quoting...

they can... but only if they build their inputs from already decomposed tokens. Which means working within the lexemes provided by the parser.

If author of the token needs using spaces inside its token then token should be wrapped by double quotes....

Yes. Although I think this might get annoying for the program author.

My thinking was that we could support two different quotes: double and single, and then the program author could choose which tokens would match each one. So the meaning is imposed by the Tokenizer entries, but there are two choices as to which type of quote, so two different tokenizer entries could convert them to different atom types. say grounded chars and strings.

If we want even more general syntax at the parser level, we could support any character followed by a quote. This would give the parser even more flexibility. For example a"This is an A string" and b"This is a B string", etc.

One possibility is to use some rare character to introduce universal quoting...

My feeling is that the single-quote and double-quote character are pretty well understood, so we deviating from them adds burden to the users for a pretty small payoff.

@luketpeterson , I am not quite understand why do you prefer entering another quoting over escaping inside an input stream of the characters for the parser. Could you please elaborate?

I feel like using non-standard characters in symbols invites a lot of ambiguity and bug surface-area. We don't define rules for legal symbol names, and maybe we should. But right now, the parser effectively makes limits how badly the user can shoot their feet with crazy characters inside their symbols. And parsing escape sequences makes it harder to keep those problems out.

Some of the issues at the top of my mind:

I guess my final question is the reverse. What is the use-case for symbols that are allowed to contain all characters?

vsbogd commented 8 months ago

Thanks Luke, I didn't realize you are bothering about symbols. I agree with your opinion that there is no need to allowing characters like space or parenthesis inside a symbol name. On the other hand I would allow such symbols inside tokens. Also I think we could allow users to add additional quotation notation if they need it. Taking last thing into account I think we could add processing single quotes as a special case into parser. But I think it is also worth trying to unify single quoting with a double quoting mechanism and allowing adding another kinds of quotes in future.

luketpeterson commented 8 months ago

...there is no need to allowing characters like space or parenthesis inside a symbol name. On the other hand I would allow such symbols inside tokens.

The trouble is, the current behavior is that a parsed SyntaxNode::WordToken gets converted to a symbol atom if no tokenizer entry matches it. So we could change this but that opens up other questions.

One possibility is a filter, where we say only a subset of characters are allowed in symbol atoms. So the parser would give the tokenizer a chance to match the token, but if the tokenizer didn't match the token and it contained illegal characters, then it would lead to an error.

Taking last thing into account I think we could add processing single quotes as a special case into parser. But I think it is also worth trying to unify single quoting with a double quoting mechanism and allowing adding another kinds of quotes in future.

Agreed. I think the same code path can be used for multiple types of quotes, giving a lot of flexibility to the tokenizer on what to do with them. Possibly even making a parser-parameter. I will see if this can be easily added without a lot of code change.

ngeiswei commented 8 months ago

Using 'c' as single character literal is standard in many languages, such as C, C++, Haskell, Idris... It allows to set the difference between strings and chars at the type level as well. Also allows to keep a single ' as sigil because as long as it is not sandwiching a single character, it can be used as a regular character within a symbol (which I and others have already done in our metta codes).

That is said, I think I am somewhat agnostic about which is the best way to go.

luketpeterson commented 8 months ago

Using 'c' as single character literal is standard in many languages...

The issue is that the parser has no concept of literals except for strings. The parser constructs the AST without any knowledge of the tokenizer, and then certain syntax nodes are converted to atoms using the tokenizer. So matching something more complicated like that requires introducing custom tokens at the parser level.

This might be a reasonable design in the abstract but at the very least this would lead to the need for a tokenizer redesign (which should probably happen regardless) but it's a bit of work, especially given the way the tokenizer is effectively the mechanism to access operations and runtime state.

These are all reasonable changes and would bring the design of MeTTa more in line with a traditional language compiler/interpreter. But it's more work than I think we should take on right now.

it can be used as a regular character within a symbol (which I and others have already done in our metta codes)

This scares me a bit. Just about every other language is restrictive about what characters are allowed in symbol names, for good reasons. If MeTTa is just a framework to access programmatically then the bug surface area is minimal but as a language with a syntax, the chances for unfortunate interactions are much higher.

vsbogd commented 8 months ago

The issue is that the parser has no concept of literals except for strings.

I would say the collection of tokens is a collection of literals. Thus parser has the concept of literals, but the set of literals can be extended by user.

it can be used as a regular character within a symbol (which I and others have already done in our metta codes)

Just about every other language is restrictive about what characters are allowed in symbol names, for good reasons.

Haskell (and I believe Scheme) allow using ' as a part of the identifier. Thus it is not a rare practice.

luketpeterson commented 8 months ago

I would say the collection of tokens is a collection of literals. Thus parser has the concept of literals, but the set of literals can be extended by user.

What I mean is the parser makes no distinction between a literal and something that becomes a symbol atom, except in the special case of string literals. (which currently also become symbols)

This conversation has me thinking I should fully fold Tokenizer into the parser, and tackle https://github.com/trueagi-io/hyperon-experimental/issues/409 as part of this issue.

To keep some semblance of sanity, I'll define a default tokenizer that more-or-less preserves the current MeTTa syntax.

Extending the syntax will always be an "at your own risk" feature. The fact that the syntax is fluid may create integration issues for other tooling that expect a stable syntax. E.g. editors, debuggers, source-control & merge tools, etc.