tajmone / polygen-docs

PolyGen Documentation
GNU General Public License v2.0
1 stars 0 forks source link

Implicit / missing details about label selection #29

Open RBastianini opened 4 years ago

RBastianini commented 4 years ago

In both the English and Italian version of the PML documentation, there are some missing / implicit details regarding a few behaviours around label selection. They might not be of importance, but since I was surprised when I discovered them, I thought they were worth bringing up to your attention.

A consequence of some of these findings is that the formal language definition (§5.1) is incorrect / incomplete.

Label group concatenation

In §2.5.2 is shown how multiple label selection groups can be concatenated to reduce the verbosity of the grammar definition, such as in S ::= Conjug.(S|P).(sp|pp) ;. However, I believe this contradicts the concrete syntax in (§5.1), where an atom is defined as follows:

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        |  UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        |  ATOM "."
        |  ATOM DotLabel
        |  ATOM ".(" LABELS ")"

and thus I think the correct definition should instead be

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        |  UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        |  ATOM "."
        |  ATOM DotLabel
        |  ATOM (".(" LABELS ")")+

In order to implement compatibility with this syntax in Polygen-PHP (which I wrote according to the abstract and concrete language definitions from the readme on the official site and thus, did not support it at first), I resorted to adding this extra abstract-to-concrete-syntax conversion step (I think you might at most be interested in the docblock at the beginning of the file, where the conversion step is described through an example). Depending on how this was originally implemented in Polygen, it might also mean that there exists an extra undocumented conversion step, and thus that §5.5 needs amending as well. However I might be completely wrong on this part, as support for this syntax might be embedded in the parser for the concrete definition. I remember close to nothing about my OCaml days at the university, so I couldn't figure this out on my own.

Single label concatenation

Although it might be somewhat considered implicit in §2.5.2, multiple label selections can not only be concatenated when in groups (S ::= Something.(l1|l2).(l3|l4);), but also when taken singularly (S ::= Something.l1.l2.l3.l4;). Both declarations are correctly accepted by Polygen and influence label selection as expected. I believe that this means that the concrete syntax for ATOM is again partially incorrect and should be amended as follows:

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        |  UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        |  ATOM "."
        |  ATOM (DotLabel)+
        |  ATOM (".(" LABELS ")")+

Dot concatenation

For completeness sake, although not particularly useful, it is possible to concatenate multiple dots: S ::= Something.....;. This does not change the behaviour of the dot operator, and would still result in the same output no matter if one or one thousand dots were employed. Again, this require updating the ATOM concrete definition:

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        |  UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        |  ATOM "."+
        |  ATOM DotLabel+
        |  ATOM (".(" LABELS ")")+

Resetting on non-(non-terminals)

Another oddity I discovered, is that label reset token can not only be employed on non-terminals, but also on terminals, although it does not seem to affect the selection in any way. In order to prove this for Polygen-PHP I wrote this test where the following grammar demonstrating this property can be found.

S ::= A.a;
A ::= a: a B and. C;
B ::= a: b | c: nope;
(* When the time comes to generate C, no suitable productions will be found given the currently selected labels. *)
C ::= g: c;

This grammar is correctly parsed by Polygen, but can only produce one result, which is and b and.

To be fair, although I don't think this is explicitly stated in the documentation, it is correctly represented by the ATOM definition (which also tells us that we can use the selection reset token on ^, \ and _, which similarly does not affect the the label scope).

Selecting on non-(non-terminals)

Everything in the previous section also applies when using labels (as correctly reported by the TERM definition). So both S ::= Something and.also.this; and S ::= Something and.(this|that); are acceptable declarations, but the selected labels don't affect the generation in neither of the examples.

Mixing labels, groups and dots

Another interesting discovery I made, is that dot labels, label groups and label reset tokens can be mixed when selecting, like in the following declaration: S ::= Something.and..(notice|the|double|dots).before.the.round.braces.(and|at|the).end..;. This once again means that the ATOM definition is incorrect. I'm unsure about how to fix this, but I believe this could work:

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        |  UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        | ATOM SELCTN+

SELCTN ::= "."
        | DotLabel
        | ".(" LABELS ")"

Label precedence

A consequence of the previous discovery, is that we can mix together label selection and label reset, so I got curious about the precedence of these label operations during parsing. Turns out that in Polygen, label operations are processed from right to left, as demonstrated by this test) in Polygen-PHP. The test uses the following grammar

S ::= Generate.one.a..two.b Generate.two.a..one.b Generate.one.b..two.a Generate.two.b..one.a;
Generate ::= a: A | b: B;
A ::= one: a | two: aa;
B ::= one: b | two: bb;

that can only produce the following output: a aa b bb, basically ignoring every other selection that comes to the right of the label reset token.

tajmone commented 4 years ago

I also agree that labels in general deserve being documented deeper because it's a powerful feature of Polygen which I struggled to master due to the lack of examples in the docs — I never realized you could do some of the things in the examples you've brought forth.

Adding more and better examples could reduce the learning curve significantly. Although formal definitions of PML features are important, IMO they are easier to grasp through examples — i.e. examples allow the reader to make sense of the rules, much more than the other way around.

I suggest that if we amend the PML concrete syntax in §5.1 then we should also bump up the MINOR version number in the next release (i.e. from 1.1.0 to 1.2.0 instead of 1.1.1). Although the repository doesn't currently provide a formal definition of how the version scheme is being applied, I try to apply the SemVer guidelines — except that instead of code we're focusing on the PML Spec document here. I should add some notes about how the repo versioning scheme and the PML Spc version and Edition schemes work (see #30).

alvisespano commented 4 years ago

Labels are an advanced features: most Polygen authors in the past struggled to understand labels and to use them proficently. A different approach to how the documentation discusses the label system is probably needed. If @tajmone has already put some work on this, he could start improving the documentation and I may help.

A brief historical digression: I have never been fully satisfied with the label system: the main feature of Polygen 2.0 (which has been under development for years - lazy me :D) is indeed a totally redesigned label system -- together with the import primitive, for enabling libraries of non-terminal symbols. The new label system though has a major disadvantage - which is also a major reason for not having released them yet: the new label system is syntactically not compatible with the current one. This means that all current grammars would be rejected by the new parser and need a few modifications; some might even need a big refactoring, because the new system is more strict. This strictness is due to a full-featured type system statically checking labels, hence some old grammars may exploit pitfalls of the current label system which factually work in a way or another but are actually wrong; and the label checker would reject those.

As far as @tajmone discovery regarding the supposedly wrong parsing rules, that is a wanted feature. I know that's unsound -- and that's partly why I have always felt the label system needed an improvement -- but that way selection becomes powerful, albeit hardly predictable. The problem is: the syntax of labels is ugly. It resembles the flavour of some imperative languages out there, where language items are designed more for practical use rather than for elegance. And this often turns out to be a choice that enables unwanted behaviours and misuses. Anyway, appending mixed label atoms (where a label atom is basically the SELCTN non-terminal) is an AND operation over label activation, where the | stands for the OR. Groups occurs only for the OR operation, which is a very poorly designed syntax, as it doesn't allow free parenthesization. Labels should have an expression-like term, like a sub-language capable of expressing predicates such as (L1|(L2&L3)|L4&L5) with full-featured binary operators and associativity. And clearing up the label environment should be a keyword on its own, rather than the dot. Also, label activation should not resemble the common selection syntax (a.k.a. the "dotted" syntax): that should be a stand alone notation that does not imply right recursion. There are advantages though in the dotted syntax: it is concise and simple; plus, activated labels pop in the eye pretty easily when reviewing or writing code.

At the time being, however, I'm not sure the grammar fix @tajmone suggests is totally desirable. One easy way to find out whether it is or not is to apply the fix and compile: if the parser semantic actions (the code annotated on the right for each production) still compiles, then the grammar fix is compatible with the current AST data structures, meaning that the program will work.

tajmone commented 4 years ago

I'm not sure the grammar fix @tajmone suggests is totally desirable.

Just a clarification: it's @RBastianini who spotted the problem and suggested the fix (merits due where they're due). Also, I've never quite grasped the full extent of labels usage (although I did use them in some simple way) so I'm not sure about their concrete syntax.

But @RBastianini fix proposal (ATOM (".(" LABELS ")")+) seems to make sense to me, for they're not constrained to a single occurrence — unless the LABELS definition is already handling that, elsewhere in the grammar.

One easy way to find out whether it is or not is to apply the fix and compile: if the parser semantic actions (the code annotated on the right for each production) still compiles, then the grammar fix is compatible with the current AST data structures, meaning that the program will work.

Mhh... this seems a rather complex way to approach it (especially with all the trouble compiling OCaml code under Win OS).

Also, I think that the case at hand here has more to do with Polygen's BNF meta-grammar, as presented to the reader as a clarifying tool where he/she might double check the learned notions. This BNF doesn't necessary match an actual grammar used by a parser generator (which might have to handle some subtleties which are irrelevant in this context).

Externalizing Examples' Code and Result via Live Polygen Executions

Another approach I actually though of, although it tackles the problem from a slightly different angle, would be to externalize all of the examples code and productions into real sources and transcripts, execute them via Polygen, and then include them selectively in the source documents.

Lately, I've actually been using a similar approach for a project documenting a text adventures (IF) Library, were I was facing problems with the constant updates to the library code, as well as the IF language itself, which would make examples and their output obsolete from time to time. What I did was to move all code examples in real text adventures sources, compile them and run them against automated command scripts and capture the game session transcript, which are then imported into the source documents in real time.

This now allows me to catch if any example is broken (due to library update) from the compile error report monitored by the build toolchain, and also ensure that game session transcript match the output of a real use-case scenario.

The project in question doesn't use pandoc but Asciidoctor, which makes the whole process simpler thanks to selective text inclusion via the include:: preprocessor directive and tags for marking specific regions of text. But thanks to PP (which we're already using) it should also be doable with pandoc.

I did tinker with the idea of switching to this system at some point in the future, even if it would mean switching to Asciidoctor, but have refrained due to some considerations:

  1. The productions of many examples were manually designed to show all (or most of) the possible outcomes, something which is hard to achieve via Polygen due to randomness and the need to use specific seeds to ensure same results each time (i.e. one would have to try dozens of seed to generate all the desired results). In similar cases, the production text might be kept internal to the document, and only the source externalized for the sake of checking that the example really compiles without errors.
  2. All this work might be too much overhead if Polygen isn't actively developed (feature-wise), so it might just be simpler manually checking all the examples for the time being.
  3. Switching to Asciidoctor adds more burden on end users who wish to build the docs themselves, due having to install the Ruby language, Asciidoctor and all its dependencies; whereas the current system depends entirely on stand-alone binary tools, which are downloaded by our custom script (and won't conflict with newer versions).

Anyhow, this was an idea that has been on my mind for a while, and wanted to share, so I just grabbed this opportunity to expose it to you. Keep it in mind, in case it might be helpful in the future (for the PML Spec or any other documents that might be added, e.g. tutorials, etc.).

RBastianini commented 4 years ago

Also, I think that the case at hand here has more to do with Polygen's BNF meta-grammar, as presented to the reader as a clarifying tool where he/she might double check the learned notions. This BNF doesn't necessary match an actual grammar used by a parser generator (which might have to handle some subtleties which are irrelevant in this context).

Yes you are correct, I was actually referring to the concrete and abstract notation paragraphs of the documentation.

There were a few changes that I suggested in addition to the one you mention (ATOM (".(" LABELS ")")+) and maybe @alvisespano has some perplexities around the SELECTN symbol I added. I don't remember very much about BNF form, so it might not be formally correct. If we are unsure about SELECTN, we can review its inclusion at a later time: two of the other three changes are more useful to the reader, since they document label selection behaviours that can actually be used to influence the production while the other changes are more oddities than anything that is of actual practical use. Although the addition of SELECTN makes (at least in my intention) the concrete / abstract syntax more adherent to what is actually parsed by Polygen, I believe that the other changes alone are more meaningful to the reader. Just for clarity, these are the two I'm referring to: ATOM DotLabel to ATOM DotLabel+ and ATOM ".(" LABELS ")" to ATOM (".(" LABELS ")")+.

About externalising the code examples in the documentation to use the Polygen parser, I believe that it's actually a great idea, but since as you noted, using Polygen to generate the example productions would be impractical, it could just be used to check the correctness of the example sources, so I don't know whether this alone would justify switching to a different documentation tool...

tajmone commented 4 years ago

@RBastianini, I'm trying to figure out the fixes to the syntax you propose by studying the current EBNF grammars in App. 5 (§5.1 and §5.2). I tend to lose myself when following all these entanglements of all the possible ways that grammar definitions might branch out ...

I'm still not quite sure whether these proposed fixes are already implicit in the current EBNF grammars, or not. I really need to find a few hours hole in which I can look into this with due time, a relaxed mind and no distractions — but these are surely improvements that will have to go in a future update.

Some suggestions might be syntactically correct but bring little benefit to the reader — e.g. the | ATOM "."| ATOM "."+ fix might just add confusion. Ultimately, these grammars are there merely as references for the benefit of the reader, who is not a human parser and therefore might not really care about edge cases like S ::= Something.....; (which, as you said, don't have any detrimental effect on Polygen execution either). It's a bit like the case of multiple slashes in URLs, which are just ignored; but usually when explaining how URLs are formatted there's not need to mention this, unlike it would in a tutorial on how to parse URLs.

Editing "§5.5 Translation rules" would require careful consideration and extreme caution, for that section is rather complex (not to mention how many times I had to double check it to ensure that the correct styles and colors were being reproduced).

Externalizing Examples' Code...

About externalising the code examples in the documentation to use the Polygen parser, I believe that it's actually a great idea, but since as you noted, using Polygen to generate the example productions would be impractical, it could just be used to check the correctness of the example sources, so I don't know whether this alone would justify switching to a different documentation tool...

The question of whether it's worth (or justified) to switch to Asciidoctor is a bit complex for various reasons. The main problem is that currently Asciidoctor only supports inclusion of external code from UTF-8 encoded files, which means that Polygen sources (and output) would have to be first converted from ISO-8859-1 to UTF-8 via tools like iconv — over a year ago I submitted a request to extend Asciidoctor's include:: directive to allow specifying other file encodings, and the feature was implemented right away and will be available with the next Asciidoctor release; the problem is that there hasn't been a new release since then, because the upcoming update is going to be a MAJOR revision, so I'm still waiting for it.

Another problem is the need to be able to include selective snippets from a source file, because many of the examples do not define the S start symbol, but are snippets of a larger "virtual" example — of course, one might invoke Polygen specifying a different start symbol, but some examples might still need a broader context to be usable.

Asciidoctor allows to mark regions of text to be imported by using comments lines to set tags to mark the beginning and end of each region, which makes it very easy to pack multiple examples from the Spec into a single large source, and then extract single non-terminal definitions as required.

With pandoc, on the other hand, we need to rely on PP for a similar feature, and PP does provide a native include-like macro, and also allows to define custom macros that interface to the Shell/CMD and/or invoke custom binaries or scripts. But this also means that in order to exploit Asciidoctor style tags to mark regions of text for inclusion we'd have to create our won dedicated tool, which would probably not be as sophisticated as Asciidoctor's tag system (which is very powerful).

So, on the one hand I'm tempted to switch from pandoc to Asciidoctor, but I would definitely wait for the next Asciidoctor release, which will allow including contents for non-UTF-8 files.

As for externalizing examples in order to check their correctness using Polygen, I think it would be a worth effort only if:

  1. Polygen was being actively developed (i.e. new features added), or
  2. if we start adding more documents, like tutorials (this repo isn't limited to PML Spec docs, although they are the only ones currently available).

Surely, both the switch to Asciidoctor and the externalization of code are (and will still be) on my mind, I'm just waiting to see how things evolve.

As a general rule, I tend to use always Asciidoctor for my documentation projects, but for this repo I picked pandoc for various reasons: simplicity of toolchain setup, ease of extending the limits of pandoc via PP, higher control over documents templates (compared to Asciidoctor), and because it would be easier to create a GitHub Pages website in the future (i.e. adding navigation menus to the documents via templates).

Weighing the pros and cons of pandoc vs Asciidoctor is not easy, they are both exceptional tools which shine in their own right for the tasks they were designed for.