Open mclegrand opened 5 years ago
I assume that the SVG abstract parser is described in sufficient detail to guarantee a "greedy" approach, so that example would necessarily parse as a single 12
, right? (I'm saying this without trying to dig down into the parser details...)
That's definitely the case for CSS, at least.
The problem arose when implementing a ragel parser for svg transforms (in Inkscape), watching it cut floats in absurd places, and when debugging, seeing that the tree allowed it, then seeing that the spec allows it.
I "fixed" the problem for me by removing the ?
but I think the only non-ambiguous cases where a space is not necessary are too absurd to care and that the ?
should always be dropped in the spec in those cases (like scale(1..1) == scale(1,0.1)
, rotate(1e11e11e1)==rotate(10,10,10)
[edit : note that a greedy approach here won't work as reading 1e11 prevents reading a valid float after it, even though there exist a valid parsing] , etc, are technically allowed currently, but the spec is currently unclear on what happens on ambiguous cases like translate(1.1.1)
and those horrors should better be left invalid than implementation-dependent )
Hence this bug and MR
And hm, looking into the spec, I don't find any mention of whether the parsing is greedy-unto-failure like CSS or if it allows backtracking like a regex. None of the mentions of BNF or links to ABNF/EBNF seem to make that detail clear, unless I'm missing a detail somewhere.
mmh, reading further, maybe if I also implement a CSS lexer/tokenizer closely following https://www.w3.org/TR/css-syntax-3/#consume-number (which is sort of greedy) I will get a non-ambiguous result, instead of "just" using the BNF for numbers
All the attributes specified using CSS syntax are indeed well-specified in this regard; CSS tokenization is greedy (and is generic, with no influence from grammars or context), and then parsing over those tokens is non-greedy (well, as greedy as a standard regex).
But the transform attribute's syntax is specified in BNF with no clarification in this regard, unfortunately. I know the intention is that numbers parse greedily, tho.
OK, thanks!
In light of this, with the precision I don't think the spec has ambiguous syntax, just that the parsing could be much more simpler to do without the ?
as it would not require a tokenizer anymore :slightly_smiling_face: (this issue can be closed if that's ok)
Making the separator optional was an intentional change based on browser behavior, see https://github.com/w3c/csswg-drafts/issues/2623 (Unfortunately, it looks like no one thought to test Inkscape at the time!)
(For reference: The original syntax in SVG 11 required the separator. )
Part of that discussion covers the similarity with CSS and also SVG path parsing when it comes to optional separators if tokenization is unambiguous. (@mclegrand, you might want to check those sections of the Inkscape codebase to see how the other parsers handle the tokenization!) It should match how those other parsers handle the unseparated tokens: e.g., 5-5
same as 5 -5
, and 0.1.2
same as 0.1 0.2
.
I definitely don't want to start talking about using non-greedy parsing to try to avoid parsing errors. The matrix()
function requires six parameters, should the parser split up a single six-digit number to make it pass??? If you have any suggestions on wording to clarify that part, Tab, please contribute them. But otherwise I think we can close this issue.
Re-opening because the Transforms spec is ambiguous here. It specifies parsing of transform just by reference to BNF, which doesn't specify whether "123456" can match "six digit+ sequences" or not!
Impls, and agreement with CSS, dictates that it shouldn't - that's a single digit+ - but it's not clear in the spec.
translate and scale have
number ( comma-wsp? number )?
which allows fornumber number
:boom: without separators, which confusers parsers and seems like a bug, which should benumber ( comma-wsp number )?
. Same for rotate :( comma-wsp? number comma-wsp? number )?
should be( comma-wsp number comma-wsp number )?
(The ambiguity is : transform(12) should not be parsed as transform(1 2) just because "1" and "2" are numbers and hence the parsing can be done that way. )
edit : same for matrix which has comma-wsp? between all attributes