simonw / datasette-render-markdown

Datasette plugin for rendering Markdown
Apache License 2.0
25 stars 0 forks source link

"Q&A against documentation" on the datasette.io homepage #13

Closed simonw closed 1 year ago

simonw commented 1 year ago
Screenshot 2023-01-27 at 2 54 55 PM
simonw commented 1 year ago

Here's the YAML:

https://github.com/simonw/datasette.io/blob/b821fb19eda08b4942183db507bb6f986f8134bf/news.yaml#L3

The markdown is stored in the DB:

https://datasette.io/content/news?_sort=rowid&date__exact=2023-01-13

And rendered here:

https://github.com/simonw/datasette.io/blob/b821fb19eda08b4942183db507bb6f986f8134bf/templates/index.html#L112

https://github.com/simonw/datasette.io/blob/b821fb19eda08b4942183db507bb6f986f8134bf/templates/pages/news.html#L38

simonw commented 1 year ago

So it looks like datasette-render-markdown is the thing that renders & as & in this context.

It uses https://python-markdown.github.io/

simonw commented 1 year ago

This tool compares different markdown implementations:

https://babelmark.github.io/?text=%5BSemantic+search+answers%3A+Q%26A+against+documentation+with+GPT3+%2B+OpenAI+embeddings%5D(https%3A%2F%2Fsimonwillison.net%2F2023%2FJan%2F13%2Fsemantic-search-answers%2F)+shows+how+Datasette+can+be+used+to+implement+semantic+search+and+build+a+system+for+answering+questions+against+an+existing+corpus+of+text%2C+using+two+new+plugins%3A+%5Bdatasette-openai%5D(https%3A%2F%2Fdatasette.io%2Fplugins%2Fdatasette-openai)+and+%5Bdatasette-faiss%5D(https%3A%2F%2Fdatasette.io%2Fplugins%2Fdatasette-faiss)%2C+and+a+new+tool%3A+%5Bopenai-to-sqlite%5D(https%3A%2F%2Fdatasette.io%2Ftools%2Fopenai-to-sqlite).+

It suggests that python-markdown renders this just fine:

CleanShot 2023-01-27 at 15 13 53@2x

simonw commented 1 year ago

I think this is likely a bug in the interaction between the markdown rendering and Bleach in this plugin:

https://github.com/simonw/datasette-render-markdown/blob/c04b0b604093fa3caa69d2fe8a1fb46247f70af6/datasette_render_markdown/__init__.py#L75-L80

I can recreate it locally like this:

>>> from datasette_render_markdown import render_markdown
>>> render_markdown('[this & that](https://www.example.com/)')
Markup('<div style="white-space: normal"><p><a href="https://www.example.com/" rel="nofollow">this &amp;amp; that</a></p></div>')

Note this &amp;amp; that in the output.

simonw commented 1 year ago

Confirmed: I removed the calls to bleach and got this:

<div style="white-space: normal"><p><a href="https://www.example.com/">this &amp; that</a></p></div>

simonw commented 1 year ago

After more exploration, it turns out it's the call to bleach.linkify(...) that causes the double escaping of the ampersand.

simonw commented 1 year ago

https://bleach.readthedocs.io/en/latest/linkify.html says:

If you plan to sanitize/clean the text and linkify it, you should do that in a single pass using LinkifyFilter. This is faster and it'll use the list of allowed tags from clean.

simonw commented 1 year ago

Deployed that fix to https://datasette.io/

image