Open tajmone opened 2 years ago
This bug appears whenever a node has more than one attribute starting with html_
, e.g.
[div (html_a=a html_b=b)
]
The bug was located in pp-libs (not in PMLC), and has been fixed in the develop
branch.
The fix will be included in the next public PMLC version.
Using html_start=3
doesn't work currently, because PMLC doesn't use an HTML ol
tag. It always uses a ul
tag. However, the start
attribute is only supported in the HTML ol
tag (ordered list with numbers).
I suggest to add attribute bullet
to the PML list
node. If set to a number then an HTML ol
tag will be used, and the numbering will start at the specified number (e.g. bullet=1
create a numbered list starting at 1)
Then your example code:
[list (html_style="list-style-type:decimal" html_start=3)
... would be written like this:
[list (bullet=3)
... and the list would indeed start counting at 3.
Other values could also be supported (e.g. none
, text x
, file images/star.png
, etc.).
Besides being more convenient, a dedicated PML attribute is also more portable than using html_xxx
attributes.
The bug was located in pp-libs (not in PMLC), and has been fixed in the
develop
branch.
Are the pp-libs a left-over from PPL? or just some custom helper libraries?
The fix will be included in the next public PMLC version.
Looking forward to it. During the weekend I've been updating the pandoc to PML converter, but had to suppress some new features due to the bug.
This bug appears whenever a node has more than one attribute starting with
html_
Then I'll need to wait for the next version to keep working on the pandoc converter, since the conversion relies a lot on html_*
attributes (especially for freshly implemented elements, where it's easier to just pass all attributes that way, rather than sifting through them one by one and converting them to their PML dedicated counterparts).
Using
html_start=3
doesn't work currently, because PMLC doesn't use an HTMLol
tag. It always uses aul
tag. However, the start attribute is only supported in the HTMLol
tag (ordered list with numbers).
That's quite inconvenient. But I thought that PML would simply convert the html_
attributes blindly, as they are — which in this case, for the pandoc converter, is just fine, even if it doesn't work (i.e. at least the end user gets an idea of the original doc context).
I suggest to add attribute
bullet
to the PMLlist
node. If set to a number then an HTMLol
tag will be used, and the numbering will start at the specified number (e.g.bullet=1
create a numbered list starting at 1)
I'm not quite convinced. Lists are my less favorite feature in PML, because they are impractical to use and very verbose, adding another attribute is going to make them less appealing IMO.
Lists are one of the most frequently used features in documentation, so from an editor perspective they need to be concise, easy to use and (most important) easy to read without eye strain.
Since the above change would already be a backward breaking change (default lists would switch from being ordered to bulleted) you might just as well take the opportunity and revise the lists syntax entirely, to make them friendlier.
IMO, lists are so important and ubiquitous that each type deserves its own node — just like in HTML where you have <ol>
vs <ul>
, and most lightweight markup syntaxes which provide different formatting syntaxes for each list type (ordered, bulleted, definitions, Q&A, example, and task lists, each with its own syntax to accommodate specific needs).
Bear in mind that there are many other types of lists that still need to be implemented, and I'm not sure you'll be able to preserve a single [list
node to cover them all via attributes. But even if you did manage, the problem remains that from the editor's perspective it's also important that by looking at the source document you immediately grasp how the list will look like in the final document, in order to be able to edit the text smoothly.
... would be written like this:
[list (bullet=3)
I would expect the bullet
attribute to describe the bullet type (disc, square, etc.) not the start number. This might be quite confusing, and deviates from the well established editorial conventions and argot.
There's also a semantic problem here; if the attribute indicates the bullet, then how would you handle roman numerals start numbers? would it be bullet=iii
? In HTML you don't get this confusion since the definitions of the bullet type, the numeral system and the start number are each defined separately, whereas the generic bullet
attribute you're proposing seems to suggest that all those different aspects converge into a single definition here: it's either a bulleted or numeral, and the latter can also establish the starting number.
I can see the benefits of bullet=3
capturing at once the definition of the list being (1) ordered, and (2) starting from number three; but this would only make sense if you were to support all the numeral systems, via autodetection. But these also include the Greek (lower and upper) letters, which would means that these would have to be literally represented by their Greek Unicode glyphs (as opposed to greel-lower
), which are usually not available in monospace font, hence resulting in gibberish in the terminal, affecting version control and other tools.
I think it's worth considering a different approach altogether for lists, bearing in mind their central role in documentation, and the different types of lists that might need to be supported in the future.
Are the pp-libs a left-over from PPL?
No
or just some custom helper libraries?
pp-libs are general purpose libraries used in PDML, PML, and other projects. They were created at the same time the PDML Java parser was written.
I thought that PML would simply convert the html_ attributes blindly, as they are
Yes, that's how it works.
Lists are my less favorite feature in PML, because they are impractical to use ...
Could you please give an example of what you mean by "impractical". And maybe also provide concrete suggestions for improvement.
... and very verbose
Yes, PML lists are more verbose than lists in Markdown and Asciidoc, but they are also less verbose than in HTML. Moreover, as discussed already, PML lists can have content of arbitrary complexity (like HTML lists), which is IMO a big advantage (that comes at the price of a more verbose syntax).
The verbosity could be mitigated for simple lists if we provide a 'simple list' variant (similar to table
and table_data
). Then, instead of writing
[list
[el item 1]
[el item 2]
[el item 3]
]
... you could simply write:
[slist
item 1
item 2
item 3
]
Since the above change would already be a backward breaking change (default lists would switch from being ordered to bulleted)
No, default lists would still be bulleted.
IMO, lists are so important and ubiquitous that each type deserves its own node — just like in HTML where you have
<ol>
vs<ul>
, and most lightweight markup syntaxes which provide different formatting syntaxes for each list type (ordered, bulleted, definitions, Q&A, example, and task lists, each with its own syntax to accommodate specific needs).
More specific list nodes can easily be added. The challenge is to specify them (node names and their attributes) in a way that makes it convenient to read and write them. I suggest to create a dedicated discussion for concreate suggestions.
Lists are my less favorite feature in PML, because they are impractical to use ...
Could you please give an example of what you mean by "impractical". And maybe also provide concrete suggestions for improvement.
Anything other than a bullet-list requires additional attributes, which are verbose and affect the visual alignment of list elements, making them less WYSIWYG. E.g. if I need to control the start number of an ordered list, or its numeral symbols, I need to specify these via attributes:
[list (html_style="list-style-type:lower-roman" html_start=3)
[el Lorem one.]
[el Lorem two.]
]
If you compare it to pandoc markdown you can clearly see how the latter is more visually direct:
3. Lorem one.
@. Lorem two.
While these examples seem trivial, you have to consider that some documents deal with very long lists (software licenses is an example) where numbered lists carry on for many pages, and contain multiple nested lists (ordered or bulleted). In such cases, with PML it becomes hard to track which elements number you're dealing with, since the [el
tag doesn't carry such implicit information, whereas pandoc markdown and AsciiDoc both optionally allow to do so:
748. Lorem.
* Lorem ipsum.
* Lorem ipsum.
i. Ipsum.
ii. Ipsum.
iii. Ipsum.
iv. Ipsum.
749. Lorem.
For an editor, being able to quickly find a specific list element (by its number) is crucial when revising and proofreading text.
But the same principle will apply to other lists types, so in case of tasks-lists we'll need attributes to handle the cross/tick status icon to, which will probably look something like this:
[list (type=task)
[el (status=done) Buy cheese.]
[el (status=tbd) Buy onions.]
]
which is not as visually intuitive as its markdown counterpart:
- [x] Buy cheese.
- [ ] Buy onions.
Visual considerations aside, the PML lists in the above examples are also less practical to edit due to the presence of attributes groups on the elements, which render multi-cursor editing harder due to alignment differences, and they also impact key-board navigation to the start of a list element contents. When editing lists, its quite common to apply global-document changes via RegExs, and the presence (or absence) of attributes groups between the opening list element tag and its contents add complexity to the task.
These are the reasons why (in another post) I mentioned that it would make sense to deviate from the standard syntax notation for some elements, like lists. Having already established that this is not possible, all we can do then is look for either some syntactic sugar, or alternative list syntaxes that are slimmer and mitigate these problems.
Yes, PML lists are more verbose than lists in Markdown and Asciidoc, but they are also less verbose than in HTML.
But it's more than just verbosity, it's about the nature and goals of the syntaxes. Markdown, AsciiDoc and other lightweight markup syntaxes are designed to be human-readable, whereas HTML and the whole XML/SGML family are oriented toward machine-readability and serialization. So, for the former group reduced verbosity is just one aspect of "human friendliness", along with other considerations which cover ease-of-editing, intuitive representation, and (very often) the ability to embed the syntax in source code comments without "uglifying" the code.
Whereas Markdown and AsciiDoc are designed to look and feel good in plain text editors, HTML and XML are designed to work best with WYIWYG GUI abstractions which hide away all the source tags and attributes from view. PML seems to fall somewhere between these two paradigms, aiming to preserve the rigorousness of XML serialization but at the same time reduce verbosity and move toward plain-text human-friendliness by pruning away the unnecessary verbosity from the data-serialization scaffolding (closing tag, etc.).
But then, once established that the PML syntax is formally bound to its parsing/serialization model (i.e. that no exceptions are allowed, e.g. to have lists represented via visual bullets or digits, as in markdown), then it's clear that PML is closer to data serialization standards like XML, JSON, YAML, etc., than it is to lightweight markup syntaxes. This being the case, one might then argue that verbosity is not really an issue, since serialization allows to propose WYSIWYG editors for PML, just like it's for HTML, so end users won't have to deal with attributes when editing, unless they bring up the dedicated interface that exposes them.
That is to say that argument "is more verbose than markdown but less than HTML" will most likely always apply to PML, giving the current syntax constraints. But from an editor's point of view, what really matters is what constitutes "practicality". Verbosity is just one aspect of the problem, but as we've seen above there are other very practical considerations when it comes to editing.
People who work in editing professionally to that all day, possibly eight hours a day, often more. When you find yourself in that situation, every little thing matters. In word-processing, WYSIWYG has been "the norm" for ages, whereas lightweight markup syntaxes are more of a "programmers' trend" that entered the scene because of in-code documentation and the need to version-control documentation for collaborative editing. I'm convinced that the new trend for all word-processing will be to move away from proprietary binary document formats toward plain-text open-standard formats. Even if many editors will keep working with WYSIWYG editors, the need for plain-text sources in order to allow version-controlled collaborative editing via the Internet will prevail in the long term. And having a human readable plain-text format is better than having an hostile format, because there's always some tech guy that needs to look at commits diffs and track changes manually; in that respect, PML is much more practical than XML or JSON.
Moreover, as discussed already, PML lists can have content of arbitrary complexity (like HTML lists), which is IMO a big advantage (that comes at the price of a more verbose syntax).
Right now we have a basic list syntax, which defaults to a bullet list unless otherwise constrained via HTML attributes. Bear in mind that AsciiDoc covers and extensive variety of list types, providing end users fine-grain control about the minute details of their representation:
Pandoc covers most of the above too, and also provides numbered example lists. But pandoc offers extreme fine grain control over list markers, since some output formats (e.g. Docx, LaTex, etc.) support them, so you can specify list numbers as 1.
, i)
or (a)
and pandoc will try to preserve the .
or parentheses accordingly.
Both AsciiDoc and pandoc markdown lists have been shaped in the course of time around the real-case needs of those who work daily with editing. Although they are not perfect, and in hindsight one could easily suggest improvements (which of course come too late for any syntax that has been standardized), they are used by millions of people in their daily work, as they have for over a decade, and this is something we have to face and come to terms with.
I've been working in the field of documentation and editing for over three decades, and know how it feels to work ten hours a day with MS Word (and other WYSIWYG editors) vs a plain-text editor and markdown or AsciiDoc — and experienced first hand both the pros and cons of each. To me, the AsciiDoc way to handle lists has been nothing but a blessing, and I really fail to understand how someone could deem it unpractical in any way (and I've worked on the digital reprint of some old manuals with fairly nasty complex lists).
The verbosity could be mitigated for simple lists if we provide a 'simple list' variant (similar to
table
andtable_data
). Then [...] you could simply write:[slist item 1 item 2 item 3 ]
Yes, that's a good example of how to abstract away complexity. Although it doesn't break the "formal parser model", this type of abstraction does mitigate the "serialization rigidity" problem to quite a degree.
More specific list nodes can easily be added. The challenge is to specify them (node names and their attributes) in a way that makes it convenient to read and write them. I suggest to create a dedicated discussion for concreate suggestions.
What I was trying to point out here is that in the world of lightweight markup syntaxes end users tend to like to see syntax elements represented either by specific symbols (special characters) or "space collocation". The former can be exemplified by AsciiDoc definition lists, which are delimited via ::
, whereas the latter is more apparent in pandoc markdown, where line-beginnings and indentation play an important role in the syntax of specific elements. In both cases, though, the whole idea is that the way the syntax impacts the "electronic page" matters greatly.
Let's not forget that there's a strong connection between ASCII-Art and the genesis of lightweight markup syntaxes (going back to the early shell commands help docs and software documentation in the terminal era). A lot of this has to do with how "the page" (intended as a monospace source file or terminal screen) looks like, which is why markdown lists are loved so much — they represent in monospace how a list should like:
1. Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
2. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat.
and this is what makes markdown (ReSt, etc.) ideal for in-source documentation, because it's not invasive even when used inside code comments, just like *italic*
and **bold**
.
I suggest to create a dedicated discussion for concreate suggestions.
Agreed. But I hope the above helped defining the "general problem", which has more to do with the heritage of lightweight markup syntaxes, and how this might affect end users expectations in terms of syntax lightness and abstracting away complexity.
I hope the above helped defining the "general problem", which has more to do with the heritage of lightweight markup syntaxes, and how this might affect end users expectations in terms of syntax lightness and abstracting away complexity.
Yes, thank you. I can now see clearly what you mean. We will improve PML lists, until one day you will hopefully say: "Lists are my most favorite feature in PML!". ;-)
Bug fixed in version 4.0.0.
Not closed, but label changed from Bug
to Enhancement
because of the suggested list enhancements.
I changed the title from "HTML Attribute Crash Error" to "List Enhancements". We should open a new issue "List Enhancements", copy/paste relevant comments from this issue, and then close this issue.
The PML source line:
caused the following
PARSER_EVENT_HANDLER_ERROR
crash error:See also my recent bug report about other problems with HTML attributes handling: #90