[doc] html_unescape: Create html.unescape extension and use it for no-break space

JJRcop commented 10 months ago

Fixes no-break space by adding a new html_unescape extension to the docs, which runs Python's html.unescape on source files before Sphinx renders them.

This lets us use HTML references like   in the docs, which get turned into the real characters as Sphinx is reading the files to render them. The source files are not affected, this only happens when rendering.

I have also published this extension in my own name under a different license (the same one Sphinx uses) for others to use: https://github.com/JJRcop/sphinxcontrib-html_unescape

gnif commented 10 months ago

Is there a better way to do this, such as  ?

JJRcop commented 10 months ago

Is there a better way to do this, such as  ?

The docutils (sphinx sits on top of this) FAQ page seems to recommend using the literal character rather than escaping it.

How can I represent esoteric characters (e.g. character entities) in a document?

For example, say you want an em-dash (XML character entity —, Unicode character U+2014) in your document: use a real em-dash. Insert literal characters (e.g. type a real em-dash) into your input file, using whatever encoding suits your application, and tell Docutils the input encoding. Docutils uses Unicode internally, so the em-dash character is U+2014 internally. […] ReStructuredText has no character entity subsystem; it doesn't know anything about XML character entities. To Docutils, "—" in input text is 7 discrete characters; no interpretation happens. When writing HTML, the "&" is converted to "&", so in the raw output you'd see "—". There's no difference in interpretation for text inside or outside inline literals or literal blocks -- there's no character entity interpretation in either case.

It continues talking about a workaround using |substitution|, but rST doesn't support nested inline markup which would be needed for that to work in this case (since it's under the "literal" markup of `` already)

Is nested inline markup possible?

Not currently, no. It's on the to-do list (details here), and hopefully will be part of the reStructuredText parser soon. [...] There are workarounds, but they are either convoluted or ugly or both. They are not recommended.

I was doing further research and found we could run html.unescape() on each file contents from the python standard library, which would enable terms like  . Probably best to do that as a sphinx extension

JJRcop commented 10 months ago

I was doing further research and found we could run html.unescape() on each file contents from the python standard library, which would enable terms like . Probably best to do that as a sphinx extension

Done

gnif / LookingGlass

[doc] html_unescape: Create html.unescape extension and use it for no-break space #1095

How can I represent esoteric characters (e.g. character entities) in a document?

Is nested inline markup possible?