pepr / asciidoc

Text based document generation. AsciiDoc is a text document format for writing notes, documentation, articles, books, ebooks, slideshows, web pages, man pages and blogs. AsciiDoc files can be translated to many formats including HTML, PDF, EPUB, man page.
http://asciidoc.org/
GNU General Public License v2.0
5 stars 0 forks source link

Braindump #7

Open elextr opened 9 years ago

elextr commented 9 years ago

This is just a braindump of things that you may (or may not) find useful for porting and re-design.

The (rough) outline design of the current implementation is:

read input until markup is recognised
if markup is start of an element
    output start of element based on configuration template
    process input recursively until end element found
    output end of element based on configuration template
else markup must be end of an element
    return

The problem with this design is that it has no memory of what it has seen except for the direct parent elements of the current location, and it has no clue what will come in the future. Thus it cannot make things like tables of contents, since when it is outputting the TOC near the front of the document it hasn't seen any of the contents it is tabling. That is why it uses the Javascript TOC generator since that can see the whole DOM at display time. And forward links cannot access any information from the target because it hasn't been processed when the link is written.

A redesign (as distinct from just porting from Python 2 to 3) should address the above by parsing the whole document and manipulating the resulting tree before translating for output. This will allow static HTML TOCs and links as Asciidoctor generates and also allow several other features that need the whole doc to be visible.

As mentioned on the Asciidoc thread, the Python design uses a lot of regular expressions, both in the code, and especially in the config files, for recognising markup during parsing. Since Python 3 has changed the semantics of regular expressions to always be Unicode by default, for the Python 3 port all the regexes need to be checked that they match the correct thing when they are Unicode instead of ASCII (or they need to be explicitly marked ASCII).

There are some pieces of embedded Python in the config files and in the filters, these also need to be checked as Python 3. Also the code needs to be checked to make sure that the right version of Python is run when these are run.

Both the issues may make it hard to run the same code and config files on both Python 2 and 3 since there is a difference in semantics between the implementations.

pepr commented 9 years ago

Thanks, @elextr. I have updated the Plan to reference your ideas.

For the Unicode, should not it work as is? I did not check it, but it should work as is. All the strings now are unicode inside, and all the patterns are also unicode.