pml-lang / pml-companion

Java source code of the 'PML Companion (PMLC)'

https://www.pml-lang.dev

GNU General Public License v2.0

22 stars 1 forks source link

Word Boundaries in Opening Node Tags #39

Open tajmone opened 3 years ago

tajmone commented 3 years ago

The line An [c\[admon] block raises a conversion error:

Error: 'c\' is an invalid node tag.

Forcing to change the line to An [c \[admon] block.

I would expect the pmlc parser to consider any invalid token ID character (i.e. [^a-z_]) as a word boundary, making it unnecessary to insert a space when a tag is followed by an escape, the [ of a nested node, etc.

At least, this is how I've implemented most tags in Sublime PML, where nodes are usually captured via a RegEx like (?<!\\)\[c\b. This seems the natural way to handle opening node tags, for it will cover all sort of contexts — e.g. the tag being followed by spaces, tabs, or even a new line:

Some [c
    inline
    code
]

which is perfectly valid PML code.

In any case, these details are important to know to correctly implement editor syntaxes that correctly mimic pmlc's behaviour — as the saying goes, "The devil is in the details".

pml-lang commented 3 years ago

The line An [c\[admon] block raises a conversion error:

Yes, because it's invalid syntax.

Forcing to change the line to An [c \[admon] block.

That's the correct way to write it.

unnecessary to insert a space

A space must be inserted. The (not yet documented) rule is: A name must be ended by

a whitespace character (space, tab, or new line) or
a ] (in case of an empty node like [name])

The same rule is applied in the new parser.

I would expect the pmlc parser to consider any invalid token ID character (i.e. [^a-z_]) as a word boundary

That would be convenient in some cases. But more error-prone for humans. Consider this code:

[name*foo bar]

It reads like a node with name name*foo, and content bar. But it would indeed be a node with name name and content *foo bar.

The 'whitespace after name' rule avoids such traps.

tajmone commented 3 years ago

A space must be inserted.

That's going to complicate the details of many editor syntaxes: whereas the intuitive solution is a general RegEx pattern \[<tag>\b to cover a node, this rule requires a RegEx that explicitly accounts for for a trailing whitespace or ].

a whitespace character (space, tab, or new line) or

the problematic aspect here is the definition of whitespace character , which varies from one RegEx engine to another, and usually cover more elements than just those three listed above (non-breaking space, thin-space and other sized-spaces, vertical tab, non-Latin whitespace chars, etc.).

Hopefully the \s should cover them all, but that largely depends on the RegEx engine being used, the settings used to compile it, and the locale the PML source is being written in.

E.g. in the Oniguruma engine, whitespace is defined as:

  \s       whitespace char

           Not Unicode:
             \t, \n, \v, \f, \r, \x20

           Unicode case:
             U+0009, U+000A, U+000B, U+000C, U+000D, U+0085(NEL),
             General_Category -- Line_Separator
                              -- Paragraph_Separator
                              -- Space_Separator

and Unicode space as:

      space    Space_Separator | Line_Separator | Paragraph_Separator |
               U+0009 | U+000A | U+000B | U+000C | U+000D | U+0085

etc.

The point is that an editor syntax should mark as invalid a malformed node (in this case, not followed by a whitespace). I think that most syntax developers will be tempted to just used the \[<tag>\b pattern, which would then accept without complaints something like [name*foo bar].

The correct alternative RegEx seems to be the \[<tag>(\s|$) pattern (you also need to account for a tag occurring at the end of a document); but the definition of \s (as mentioned above) will vary from engine to engine, and its beyond the developer's control.

If we were dealing with ordinary code, this wouldn't have been a problem, but since PML sources are text documents you can't really predict what characters to expect (anything Unicode might be there).

All I'm saying here is that considering a tag as ending on a boundary character (i.e. any char with is not a valid tag ID char) offers a stricter definition, whereas relying on "whitespace" seems more of a generically-defined behaviour.

Both approaches have pros and cons, but its worth considering and exploring the cons too, just to be aware of potential issues.

pml-lang commented 3 years ago

the problematic aspect here is the definition of whitespace character

Yes, unfortunately there is no universal standard definition for 'whitespace'. To keep it simple, a whitespace character after a tag name in PML is defined as a "space, tab, or new line", as stated in my previous comment (i.e. regex [ \t\r\n]). That's a simple, non-ambiguous definition, mentioned in Wikipedia's chapter Definition and ambiguity on page 'Whitespace character':

The most common whitespace characters may be typed via the space bar or the tab key. Depending on context, a line-break generated by the return or enter key may be considered whitespace as well.

If we used \s or\b then the behavior would depend on the regex flavor we’re working with, which means 'different behavior in edge cases'.

Currently a non-empty PML tag is defined by the regex \[<tag>[ \t\r\n]. That's the rule applied in the parser. The whitespace character is optional in case of an empty node: \[<tag>[ \t\r\n]?\]

A downside of this approach is that if a tag name is followed by a character that is considered to be whitespace in other definitions (e.g. a form feed) then the PML parser will generate an error (illegal character in tag name). However this kind of illegal code will probably be very rare in practice.

you also need to account for a tag occurring at the end of a document

That would be invalid PML code, if it appears at the end of a document. But in the new parser it can be valid if it's at the end of an inserted file (because in very rare cases the tag could continue in the parent file).

tajmone commented 3 years ago

A downside of this approach is that if a tag name is followed by a character that is considered to be whitespace in other definitions (e.g. a form feed) then the PML parser will generate an error (illegal character in tag name). However this kind of illegal code will probably be very rare in practice.

I don't think it would be so rare in real editing practice. If PML is to support non Western languages it needs to consider what whitespace characters are used in those locales. Even within the European languages, French has special rules regarding the use of thin-spaces when enclosing text in double quotes or Guillemets:

The French language is known to pose many challenges to lightweight markup syntaxes when it comes to correctly spacing punctuation marks (e.g. when to insert thin-spaces, non-breaking spaces, etc.) which often requires special extensions to handle them.

What happens if an opening tag is followed by a control character to change text direction? (e.g. a western text that need to insert some Arabic or Hebrew words or sentences). A [span node is a good candidate to be found in such circumstances.

It might be tempting to say "just insert a space, PMLC will consume it"; except that this might not work in many cases, e.g. when you want/need to apply a node to part of a word, without splitting it semantically — (e.g. Arabic has a cursive-like alphabet, where letters are joined with each other by a base line, so you can't insert a space (any kind of space) without breaking the word (or words compound).

I've encountered this whitespace problem so many times in real document editing work. I can't vouch for all the existing languages (spoken or dead) covered by Unicode, by knowing Arabic and Hebrew I can assure you that the \[<tag>[ \t\r\n] definition is going to turn out a problem in supporting RTL Semitic languages.

I am skeptical of undefined behaviours — and trusting the "whitespace" definition as a reference is basically opening the doors to undefined behaviour. Look at how much damage undefined behaviours as done with C and C++, being the root cause of most bugs and vulnerabilities.

That would be invalid PML code, if it appears at the end of a document.

I meant that an editor syntax (or an LSP Lang Server) has to account also for the possibility that the text being edited might lead to an opening node being at the end of the document.

tajmone commented 3 years ago

Anyhow, I'll update Sublime PML to implement the suggested pattern, and also add some edge cases tests in the PML Playground to see how it plays out in real case usages.

I'm assuming that the first whitespace character following the tag is consumed by the parser, i.e. it's considered a mere separator which won't generate the equivalent whitespace char in the output (that would alleviate potential problems).

pml-lang commented 3 years ago

the first whitespace character following the tag is consumed by the parser, i.e. it's considered a mere separator which won't generate the equivalent whitespace char in the output

Yes, sure. The parser consumes the mandatory whitespace character after the tag's name, so that it does not appear in the text. After this whitespace character, any Unicode character can be inserted as part of the text, including all Unicode whitespace characters.

For example, suppose we want to write the word 'analyzing' in italics, with the letter 'z' in bold. This is the PML code:

[i analy[b z]ing]

And this is the HTML code generated by the PML converter (class attributes removed):

<i>analy<b>z</b>ing</i>

change text direction

Can currently be done like this (for HTML output):

[p html_style="direction: rtl;" right is not left]

HTML generated:

<p style="direction: rtl;">right is not left</p>

I am skeptical of undefined behaviours

Fully agree. Undefined behaviour can create real damage of all kinds. Tree structure validation must be added in a future PML version.

tajmone commented 3 years ago

The parser consumes the mandatory whitespace character after the tag's name

That's great. But I still want to test it with some real case examples, e.g. with RTL languages, to ensure that editors don't insert some extra unwanted characters when typing tags like this — e.g. some language or direction control characters due to the switch from a RTL main text to Western alphabet when inserting a node in the middle of a word; like your bold z example, but within an Arabic word.

I'll have to carry out these tests in another editor though, because Sublime Text supports RTL languages very poorly. Notepad++ does a better job, and I haven't tried VSCode. Ideally, these tests should be done in an editor designed with RTL support in mind.

A preliminary basic test (using Notepad++) seems to work OK (1st line the plain word; 2nd line the middle consonant is in bold):

    برمجة

    بر[b م]جة

(in GitHub preview the second line is broken, but in the resulting HTML after pmlc conversion it looks OK)

where the word with the bold consonant results in the following HTML:

<p class="pml-paragraph">بر<b class="pml-bold">م</b>جة</p>

but of course the editing experience is a bit messy when the cursor encounters a RTL language change, and insertions break the sentence.

My fear is that proper RTL editors might be inserting direction control character in the background here — after all, without them it would be very hard to work with the text.

I'm not sure of this, it's a long time I haven't edited similar texts; but I remember that in XP Arabic Edition there were similar problems when editing HTML documents in various apps, depending on the degree of support for RTL (or how Win Arabic intervened therein). Today there is no longer a separate Windows edition for RTL lanaguages, as Win 10 natively supports them out of the box, so I'll have to do some tests on this.

Note on SGML Entities

As a side note, handling similar cases (i.e. an Arabic word in the middle of an English sentence) would be much easier by econding the Arabic text to HTML entities:

&#x628;&#x631;&#x645;&#x62C;&#x629;

This would also lift most of the problems associated with limited RTL support in editors.

Conversion from Arabic to HTML entities can be easily done with an HTML entity encoder/decoder:

https://mothereff.in/html-entities

But in PML this would require wrapping the encoded text in a [verbatim node, which would then prevent styling it via inline nodes:


    Not allowed:

    [verbatim &#x628;&#x631;[b &#x645;]&#x62C;&#x629;]

So the above example can't be achieved this way.

tajmone commented 3 years ago

BiDi Sample Doc

I've added some preliminary samples of RTL usage in PML:

https://github.com/tajmone/pml-playground/blob/main/pml-samples/bidi-text.pml

So far, so good (but without a dedicated editor it's a real pain to edit source files with BiDi texts).

pml-lang commented 3 years ago

in PML this would require wrapping the encoded text in a [verbatim node

The new parser will also support the \uhhhh escape sequence. This allows you to insert any Unicode character in text

For example this:

Hell\u006f!

... is parsed as:

Hello!

I've added some preliminary samples of RTL usage in PML:

Great. It's nice to see Arabic and Hebrew text used in PML.

tajmone commented 3 years ago

The new parser will also support the \uhhhh escape sequence.

That's a useful addition indeed!

Thin-Space Node?

I would suggest also adding a thin-space node, because I believe French uses them quite a lot to separate enclosing double quotes and chevrons from the quoted word(s).

Maybe something like [tsp] or [thsp]? E.g.

«[thsp]Guillemets[thsp]»

Although in Italian we don't have explicit rules for this, when using Guillemets good printers follow the French tradition and do add thin spaces.

Word-Joiners around Thin-Spaces?

The problem though is that these should also be non-breaking spaces, i.e. you don't want the text to wrap after an opening double chevron or before a closing one. So maybe the [thsp] node should also really emit a word-joiner before and after the thin space character, to prevent wrapping.

I do realize that the Unicode escapes could be used instead, but since it's a basic punctuation in French it might deserve a native node — just like there's [sp] for non breaking-spaces. Also, a dedicated node would allow to control which extra characters are inject around the thin space (like mentioned above, word-joiners).

tajmone commented 3 years ago

Inconsistent Parser Behaviour

I've started implementing in Sublime PML the tag boundary pattern as you suggested ((?=[ \t\n\]])), but now some syntax tests fail for documents which are valid PML, e.g.:

    L1[nl[- comment -]]L2

    L1[b[nl[- comment -] bold]]L2

    L1[b[- comment -] bold ]L2

The above nodes [nl[- and [b[nl[- do build correctly without errors, and with the previous boundary RegEx \b they were all highlighted correctly; but with the suggested mandatory whitespace + ] the syntax breaks because these valid cases are no longer covered.

Now, I'm quite unsure what I should do. Should I keep a more relaxed syntax that correctly highlights node which might cause error at build time, or should I instead enforce a stricter approach which doesn't highlight nodes which are valid PML?

IMO, it's better to keep the syntax as it was, an revert to using just a \b boundary, and risking to highlight correctly nodes which will cause an error (after all, the error is not the node tag, but the lack of separator that follows). The problem here is that it's impossible to mimic PMLC's real behaviour if it's inconsistent across tags, and lacking a BNF reference.

A syntax should be a good approximation of how the converter sees a document, but the focus is on the author who's editing the document, not on mimicking the parser 100%. It doesn't need to cover all edge cases, especially if this introduces unnecessary complexity.

Bear in mind that before implementing any syntax element in Sub.PML I always do extensive local tests, trying to figure out each tag's behaviour in various context (line breaks, tabs, etc.). So far tests have been a better guidance than documentation, because many aspects are not documented and not all tags behave the same.

The point here is that we're looking at the PML syntax from different angles, due to different needs — you're interested in proper parsing, whereas I'm more interest in a realistic highlighting of the syntax which need to be real-time per formant and is easy to maintain.

So I should better highlight correctly [c\[admon] in Sublime PML, even though I know it will raise an error; but I can compensate for this with snippets, auto-completions and keyboard shortcuts, by always inserting a space after the tag.

The above considerations hold true also for a PML Lang Serv, were accuracy needs to be sacrificed for performance. Unlike PMLC, which acts on a source file which is immutable, syntax highlighter have to deal with authors editing the code all the time, each keystroke forcing an highlighter update.

pml-lang commented 3 years ago

A space must be inserted.

I'm very sorry, Tristano, because I wasn't precise enough in my previous comment. The space (i.e. [ \t\r\n]) is only required if the name is followed by text. It is not required if the name is followed by another node (e.g. [a[b]]), or a comment (e.g. [a[-c-]], or (only with the new parser) an attribute list (e.g. [a(p=v)]).

Before continuing the discussion I would like to ask you: Couldn't you simply ignore the character after the name, and just apply the following regex for a valid name (as specified in the pXML BNF)?

[a-zA-Z_][a-zA-Z0-9_-.]*

That's also the rule applied in the new parser.

Applying this regex works correctly for your initial example: [c\[admon], because \ cannot be part of a name.

This rule is less likely to change in a future version, compared to the rule of what can follow a name. And it would probably be easier to change in the future (for example if later we really want to stick to the more complex rule for XML names).

If you use a \b boundary, it's ok now because PML currently uses only letters for tag names. But that could also change in the future.

tajmone commented 3 years ago

The space (i.e. [ \t\r\n]) is only required if the name is followed by text. It is not required if the name is followed by another node (e.g. [a[b]]), or a comment (e.g. [a[-c-]],

That's why the \b is currently the safest way to determine a tag end, it works with anything following the tag identifier.

or (only with the new parser) an attribute list (e.g. [a(p=v)]).

These attribute lists will be enclosed within round brackets?

Couldn't you simply ignore the character after the name,

The problem is that without the \b if a node shares the same starting substring of another it will lead to false positives — e.g. [span could be captured as [sp. So having a boundary defined is a safety check to prevent similar cases.

and just apply the following regex for a valid name (as specified in the pXML BNF)?
[a-zA-Z_][a-zA-Z0-9_-.]*

If I've understood the above RegEx correctly, it introduces two new chars to ID names: - and . (the _-. might be interpreted as a char range in many engines, unless escaped: _\-\.). I need to test that carefully in Sublime PML, because ST uses two RegEx engines: a custom one, and Oniguruma as a fallback engine in case of unsupported RegEx features, which makes RegExs somehow unpredictable in edge cases (but syntax tests do inform you if the fallback engine is being used, bacause it's slower).

The problem with the new pattern is that the introduction of the chars - and . in ID names will most likely make the \b unusable (if one of these chars occurs last in the ID), so it will become necessary to actually match against the possible following chars. Again, different editors might handle RegExs differently in this respect, depending on the engine they use and/or its settings.

Are the - and . already in use with PMLC v1.4.0 or are they a planned feature?

This rule is less likely to change in a future version, compared to the rule of what can follow a name. And it would probably be easier to change in the future (for example if later we really want to stick to the more complex rule for XML names).

Indeed, conformance to XML naming convention for identifiers is desirable, especially when converting from HTML/XML-based formats to PML, so IDs don't undergo lossy renaming.

I guess eventually I'll have to adapt all the node tags RegEx to the new scheme.

pml-lang commented 3 years ago

These attribute lists will be enclosed within round brackets?

That's the pXML syntax. It is required in strict pXML, as explained here

Because PML version 2.0 will be based on pXML, attributes must be enclosed in parentheses if the PML code is written in strict pXML. However, the PML parser also applies lenient parsing, which allows (among other simplifications) the parentheses to be omitted. I am currently not sure yet if the new parser's lenient mode will work exactly as in the current parser. I will try to keep it compatible (because it makes the PML syntax more succinct), but it's one of the more challenging features to implement in PML 2.0. Lenient parsing will also need to be well documented, because it's important for editor plugin developers.

Here is an example of how lenient parsing will probably work, and make PML more succinct:

[ch ( title = "Chapter Title" id = myId ) // strict pXML syntax
[ch title = "Chapter Title" id = myId     // parenthesis can be omitted
[ch title = Chapter Title id = myId       // quotes can be omitted (not yet sure about this in version 2.0)
[ch Chapter Title id = myId               // the attribute name can be omitted for default attributes

it introduces two new chars to ID names: - and .

Yes, because pXML names must be compatible with XML names.

the introduction of the chars - and . in ID names will most likely make the \b unusable

Yes, that 's why I suggested to just use the regex for names.

Are the - and . already in use with PMLC v1.4.0 or are they a planned feature?

No and no. Hence using \b would still work in version 2.0, but not guaranteed to work in later versions.

If you need to detect the end of a name, maybe you can still use \b or \W (a non-word character), but exclude _, -, and .. Or maybe you can use the regex [ \t\r\n\[\]\(\\], and adapt it later if the rule changes.

pml-lang commented 3 years ago

adding a thin-space node

That would be a nice addition for some people. However, I am a bit hesitant to add this as a standard node, for the following reasons:

If we start adding country-specific nodes that are useful in some regions, then other people will probably submit similar requests, which might lead to a proliferation of nodes not used by the majority of users, and add complexity to PML. There is of course always a grey-zone, and if many people request the same feature, it should certainly be added.
Version 2.0. will allow users to define constants with Unicode escape sequences. Hence, a user who needs thsp (thin space with word-joiner before and after) could define a constant:
```
[const thsp = "\u2060\u2009\u2060"]    // 2060: Unicode word joiner; 2009: thin space
```
Even simpler, he/she could use the Unicode NARROW NO-BREAK SPACE:
```
[const thsp = "\u202F"]
```
Then guillemets could be used like this:
```
«[=thsp]Bonjour mon ami.[=thsp]»
```
Note: The final syntax might not be [=thsp]

If the constant is defined as a parameter in a config file, then the config file could be stored in a shared directory, so that thsp is available in all PML documents on the user's machine or network.
A future PML version will allow users to define their own tags (i.e. the tag name, attributes, and how to render it in HTML). A new tag definition will be located in a single file stored in a dedicated directory that can contain any number of additional customized tags. As for config files, the parser will read a chained set of directories, for maximum flexibility.
Another idea (for later) would be to provide optional country-specific PML extensions that can easily be enabled/disabled, depending on the user's needs.

Note: To keep discussions focused and limited in size, I suggest that in the future we create new discussions/isuues for subjects that deserve a new, separated entry.

pml-lang commented 3 years ago

Version 2.0. will allow users to define constants with Unicode escape sequences.

Unicode escape sequences are now available in version 2.0.0.

A future PML version will allow users to define their own tags

Info: I'm currently working on that feature.