supporting markdown links

I maintain a package that uses htmllib5 to translate Markdown into HTML (https://github.com/jvanasco/html5lib_to_markdown) alongside our Bleach usage for dealing with user-submitted text.

I thought I had a workaround for some odd behavior between Python2 and Python3, but after encountering some issues migrating the CI tests to tox, I dug into my library and this library... and I realized there was a bigger problem.

The problem is that while almost all of Markdown is valid HTML, it also support a quick "link" format which exists as a url in an unnamed tag:

<https://example.com/path/to>

While my first reaction was to handle this in a pre-processor, I remembered that context matters and I need to know if I encounter this in a code-formatting block or not -- so I need to integrate this with a tokenizer.

When these links are handled by this library's tokenizer's emitCurrentToken, the current logic creates a token name of "http:", "https:", or "mailto:". This is great.

However, the token's raw data, however, is cast into an ordered dict - which blows away any duplicate values and a chance to recreate the tag -- and some other characters trip up the delimiting. For example:

<https://example.com/a/aa/b/bb/c/d/e/f/g?foo=bar&bar=foo;#biz>

Is there any chance of html5lib supporting a use case of keeping the full data of these unnamed urls tags somehow? I don't expect them to be serialized by this library, as this is a weird HTMLish format that is not real HTML - but Markdown is a popular and widespread format that is mostly valid HTML, except for this one _____ tag.

There are a few ideas I had that are 70% towards a PR for this - but if this use-case is too outside the scope of this library, I need to spend my time looking for alternatives.

Thanks, J

html5lib / html5lib-python

supporting markdown links #514