ocsigen / html_of_wiki

Other
6 stars 3 forks source link

Is wikicreole parser slightly broken or is it tyxml or both. #133

Open hhugo opened 1 year ago

hhugo commented 1 year ago

The wikicreole parser currently emit too many B.phrasing elements, eventually splitting words into pieces.

For example, with input the, the parser emits B.phrasing for t and he. The reason seems to be that there is a rule for parsing http:..., the parser stops after t in case the h is the start of http:....

The other aspect that is weird to me is that tyxml can generate files with different browser rendering with and without indent. I think it boills down to the following : printing [ pcdata "a"; pcdata "b"] inserts a cutting hint between 'a' and 'b'. Format can decide to insert newlines if the text is too long. The browser renders "a\nb" and "ab" differently.

hhugo commented 1 year ago

https://github.com/ocsigen/tyxml/issues/288