lys-lang / node-ebnf

Create AST PEG Parsers from formal grammars in JavaScript
https://menduz.com/ebnf-highlighter/
MIT License
99 stars 9 forks source link

Unable to parse valid W3C EBNF #43

Open shellscape opened 2 years ago

shellscape commented 2 years ago

The grammar located here https://github.com/transpect/css-tools/blob/master/ebnf-scheme/CSS3.ebnf is valid W3C EBNF, as verified on railroad https://bottlecaps.de/rr/ui. This package throws an error that it could not parse the grammar at /node_modules/ebnf/dist/Grammars/W3CEBNF.js:288:19.

So it looks like there are some compatibility issues. Perhaps the grammar for W3C is out of date, given the age of the package?

shellscape commented 2 years ago

Additionally, this package cannot parse the EBNF grammar that railroad shows on its site:

import { Grammars } from 'ebnf';

const w3grammar = `Grammar ::= Production*
Production ::= NCName '::=' ( Choice | Link )
NCName ::= [http://www.w3.org/TR/xml-names/#NT-NCName]
Choice ::= SequenceOrDifference ( '|' SequenceOrDifference )*
SequenceOrDifference ::= (Item ( '-' Item | Item* ))?
Item ::= Primary ( '?' | '*' | '+' )*
Primary ::= NCName | StringLiteral | CharCode | CharClass | '(' Choice ')'
StringLiteral ::= '"' [^"]* '"' | "'" [^']* "'"
/* ws: explicit */
CharCode ::= '#x' [0-9a-fA-F]+
CharClass ::= '[' '^'? ( Char | CharCode | CharRange | CharCodeRange )+ ']'
Char ::= [http://www.w3.org/TR/xml#NT-Char]
CharRange ::= Char '-' ( Char - ']' )
CharCodeRange ::= CharCode '-' CharCode
Link ::= '[' URL ']'
URL ::= [^#x5D:/?#]+ '://' [^#x5D#]+ ('#' NCName)?
Whitespace ::= S | Comment
S ::= #x9 | #xA | #xD | #x20
Comment ::= '/*' ( [^*] | '*'+ [^*/] )* '*'* '*/'`;

const rules = Grammars.W3C.getRules(w3grammar);

This also fails with throw new Error('Could not parse ' + source); at the same line and position.

menduz commented 2 years ago

Hello, Can you try ending thr document/grammar string with a line ending char?

kjhughes commented 1 year ago

Your Char production looks hosed:

Char ::= [http://www.w3.org/TR/xml#NT-Char]

(A URL doesn't belong in a bracket expression.)

shellscape commented 1 year ago

@kjhughes that's straight from W3C

kjhughes commented 1 year ago

The RHS is clearly meant to be metadata / documentation, not an EBNF regex. The URL references this EBNF:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

jgeewax commented 1 year ago

@menduz : Just tried adding a newline at the end and that seemed to do the trick!

Might be worthwhile to not fail on no final newline character?

jimmcslim commented 1 year ago

I've tried adding a newline and still not having any success. Also been trying to parse https://github.com/messagetemplates/grammar/blob/master/message-template.ebnf without success.

Antony74 commented 1 year ago

Yes, adding a new line on the end of a string is a great tip! Additionally, even though the parser only give you a yes/no as to whether is parsed successfully or not, you can quickly narrow down the problem in the playground

https://menduz.github.io/ebnf-highlighter/

by starting with just one line at a leaf or your parse tree and building your ebnf file back up from there.

e.g. does this parse?

_LETTER-OR-DIGIT ::= [A-Za-z0-9]

No. How about this?

_LETTERORDIGIT ::= [A-Za-z0-9]

No. How about now?

LETTERORDIGIT ::= [A-Za-z0-9]

Yes. So does W3C EBNF not support an NCName entity starting with an underscore? Well, let's look at the node-ebnf source code, this is the top of W3CEBNF.ts

// https://www.w3.org/TR/REC-xml/#NT-Name
// http://www.bottlecaps.de/rr/ui

// Grammar  ::= Production*
// Production   ::= NCName '::=' Choice
// NCName   ::= [http://www.w3.org/TR/xml-names/#NT-NCName]
// Choice   ::= SequenceOrDifference ( '|' SequenceOrDifference )*
// SequenceOrDifference ::= (Item ( '-' Item | Item* ))?
// Item ::= Primary ( '?' | '*' | '+' )?
// Primary  ::= NCName | StringLiteral | CharCode | CharClass | '(' Choice ')'
// StringLiteral    ::= '"' [^"]* '"' | "'" [^']* "'"
// CharCode ::= '#x' [0-9a-fA-F]+
// CharClass    ::= '[' '^'? ( RULE_Char | CharCode | CharRange | CharCodeRange )+ ']'
// RULE_Char    ::= [http://www.w3.org/TR/xml#NT-RULE_Char]
// CharRange    ::= RULE_Char '-' ( RULE_Char - ']' )
// CharCodeRange    ::= CharCode '-' CharCode
// RULE_WHITESPACE  ::= RULE_S | Comment
// RULE_S   ::= #x9 | #xA | #xD | #x20
// Comment  ::= '/*' ( [^*] | '*'+ [^*/] )* '*'* '*/'

That tells us to look it up here: http://www.w3.org/TR/xml-names/#NT-NCName

click through to the Name: https://www.w3.org/TR/REC-xml/#NT-Name

click through to the NameStartChar: https://www.w3.org/TR/REC-xml/#NT-NameStartChar

Oh dear, it does look to me like you're supposed to be able to start an NCName entity with an underscore. So it does seem a shame that node-ebnf won't parse this. But hopefully what I've been able to demostrate about how I would isolate a fault and investigate the cause is helpful?