html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.11k stars 283 forks source link

supporting markdown links #514

Open jvanasco opened 3 years ago

jvanasco commented 3 years ago

I maintain a package that uses htmllib5 to translate Markdown into HTML (https://github.com/jvanasco/html5lib_to_markdown) alongside our Bleach usage for dealing with user-submitted text.

I thought I had a workaround for some odd behavior between Python2 and Python3, but after encountering some issues migrating the CI tests to tox, I dug into my library and this library... and I realized there was a bigger problem.

The problem is that while almost all of Markdown is valid HTML, it also support a quick "link" format which exists as a url in an unnamed tag:

<https://example.com/path/to>

While my first reaction was to handle this in a pre-processor, I remembered that context matters and I need to know if I encounter this in a code-formatting block or not -- so I need to integrate this with a tokenizer.

When these links are handled by this library's tokenizer's emitCurrentToken, the current logic creates a token name of "http:", "https:", or "mailto:". This is great.

However, the token's raw data, however, is cast into an ordered dict - which blows away any duplicate values and a chance to recreate the tag -- and some other characters trip up the delimiting. For example:

<https://example.com/a/aa/b/bb/c/d/e/f/g?foo=bar&bar=foo;#biz>

Is there any chance of html5lib supporting a use case of keeping the full data of these unnamed urls tags somehow? I don't expect them to be serialized by this library, as this is a weird HTMLish format that is not real HTML - but Markdown is a popular and widespread format that is mostly valid HTML, except for this one _____ tag.

There are a few ideas I had that are 70% towards a PR for this - but if this use-case is too outside the scope of this library, I need to spend my time looking for alternatives.

Thanks, J

theRealProHacker commented 1 year ago

Are you converting HTML to Markdown or Markdown to HTML? Because if you actually want to convert Markdown to HTML using an HTML parser, there are definitely more problems than just "this one _____ tag".

For example consider:

```html
<pre>


As you mentioned, the HTML parser doesn't know about the quotes around the pre tag and will parse it as an HTML element, which is obviously not what you want.