Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser?

jpcaruana commented 1 year ago

Hi,

I posted https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ and Bridgy posted it to mastodon https://indieweb.social/@jpcaruana/110864216016868215.

In the process, the … char is displayed as &mldr; (unicode U+2026, Horizontal Ellipsis) in the mastodon post.

Emojis work fine though, see https://indieweb.social/@jpcaruana/110849548714309169 for instance.

How could I help here?

snarfed commented 1 year ago

Interesting!

Looking at the Bridgy log, the mf2 parser ended up with two content values from your post, one plain text with the &mldr; HTML entity, one html + text:

{
      "type": [
        "h-entry"
      ],
      "properties": {
        "url": [
          "https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/"
        ],
        "content": [
          "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try.",
          {
            "html": "<h1></h1><p>Not sure. This post <a href=\"https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/\">[https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a>](https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/%3C/a%3E); is 501 chars long and <a href=\"https://indieweb.social/@jpcaruana/110836774445084366\">[https://indieweb.social/@jpcaruana/110836774445084366</a>](https://indieweb.social/@jpcaruana/110836774445084366%3C/a%3E); does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).</p><p>I haven\u2019t posted anything longer than 500 chars since&amp;mldr; I\u2019ll give it another try.</p>",
            "value": "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try."
          }
          "..."
        ]

snarfed commented 1 year ago

A few more data points:

Looking at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ right now, it has the … Unicode character directly in the content, inline
I tried previewing a publish on https://brid.gy/mastodon/@jpcaruana@indieweb.social, and it rendered correctly with the same … character inline
Interestingly, the log for that preview showed the same &mldr; character in the parsed content as above
Neither https://pin13.net/mf2/?url=https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ nor https://python.microformats.io/?url=https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ have the &mldr; in content, they both have it as escaped \u2026 Unicode instead
...which implies the &mldr; is maybe Bridgy's fault somehow

snarfed commented 1 year ago

Oops, I was wrong. I looked at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ in browser dev tools, and evidently that shows me the contents after decoding HTML entities. curl and Python requests both show that the response actually has multiple HTML entities, notably both ’ and &mldr; inside *-content. Interestingly we decode ’ ok, but not &mldr;.

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content: <div class="p-name p-content e-content">. You probably only want one of those.)

snarfed commented 1 year ago

Looks like this is a BeautifulSoup thing. Bridgy and granary fetch posts with requests:

https://github.com/snarfed/bridgy/blob/b26c8859c6db927a7a30fe1fa1a8f7d4123aa06b/webmention.py#L66-L71

...then parse it manually with BeautifulSoup, then pass that to mf2py:

https://github.com/snarfed/webutil/blob/63be8a763a618d43e957c6d414c0f6de8f298184/util.py#L1917-L1985

  if isinstance(input, requests.Response):
    content_type = input.headers.get('content-type') or ''
    input = input.text if 'charset' in content_type else input.content

  return bs4.BeautifulSoup(input, **kwargs)

def parse_mf2(input, url=None, id=None):
  ...
  return mf2py.parse(url=url, doc=input, img_with_alt=True)

When I do this outside of Bridgy/granary, BeautifulSoup converts ’ to ’ but doesn't recognize &mldr;:

>>> resp = util.requests_get('https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/')
>>> print(resp.text)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href=https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/>https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href=https://indieweb.social/@jpcaruana/110836774445084366>https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn&rsquo;t mess up with time zones).</p><p>I haven&rsquo;t posted anything longer than 500 chars since&mldr; I&rsquo;ll give it another try.</p></div>
...
>>> soup = bs4.BeautifulSoup(resp.text)
>>> print(soup)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href="https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/">https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href="https://indieweb.social/@jpcaruana/110836774445084366">https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn’t mess up with time zones).</p><p>I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.</p></div>
...

@capjamesg @angelogladding any thoughts here?

snarfed commented 1 year ago

Hmm, maybe it's an lxml thing?

>>> from bs4.diagnose import diagnose
>>> diagnose(resp.text)
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.9.16 (main, Dec  7 2022, 10:06:04)
[Clang 14.0.0 (clang-1400.0.29.202)]
Found lxml version 4.9.3.0
Found html5lib version 1.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
...
        I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
...
        I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
...
        I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.
...

snarfed commented 1 year ago

Looks like maybe yes.


>>> print(bs4.BeautifulSoup(text, 'html.parser'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'html5lib'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'lxml'))
...
<p>I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.</p>
...

snarfed commented 1 year ago

I see mldr in WHATWG's list of entities and in the 2011 HTML spec. I don't see anything obvious searching lxml's bug tracker for mldr or for entities. So I guess the next step is to file a bug with them?

capjamesg commented 1 year ago

@snarfed I was thinking about this and my first intuition was to try another parser like html5lib. It seems like an lxml issue.

snarfed commented 1 year ago

Filed https://bugs.launchpad.net/lxml/+bug/2031045

jpcaruana commented 1 year ago

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content:
. You probably only want one of those.)

Thank you, I know I had strange issues back in the days, so I ended up in duplicating indieweb class names. I'll take a look at it.

Which one is, in your opinion, the "one to keep"?

snarfed commented 1 year ago

Generally e-content, which preserves the inner HTML tags. Only use p-content if you want to collapse the value to plain text.

jpcaruana commented 1 year ago

cristal clear, thank you. Fixed on my side :)

snarfed commented 1 year ago

...aaand filed https://gitlab.gnome.org/GNOME/libxml2/-/issues/580

snarfed commented 1 year ago

Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

snarfed commented 1 year ago

Added this to the Bridgy docs: b37f45cb47c9ebcbe35eabcdce29943b137c0518, https://brid.gy/about#html-entities

capjamesg commented 1 year ago

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

jpcaruana commented 1 year ago

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

Do you know any alternative our there?

capjamesg commented 1 year ago

You can use html5lib with BeautifulSoup for HTML5 parsing, but the BeautifulSoup documentation says this parser is Very slow.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

jpcaruana commented 1 year ago

I found but did not test html5-parser: they claim it is fast:

A fast implementation of the HTML 5 parsing spec for Python

https://html5-parser.readthedocs.io/en/latest/

capjamesg commented 1 year ago

@jpcaruana Looks interesting!

There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system (using python 3) show a speedup of 37x.

snarfed commented 1 year ago

Thanks for the sleuthing, guys! Also I think BeautifulSoup's claims are something like 10y old or or more. So they may still be true, but they may well not be. 🤷

kevinmarks commented 1 year ago

I remember sknebel doing a bunch of performance work a while back https://github.com/microformats/mf2py/issues?q=label%3Aperformance+ - 5 years ago and some BeautifulSoup refactoring before that by kyle.

It may be time to look at backing out of BeautifulSoup in favour of a modern parser like html5-parser, which would be more like the way the Go parser works, but that would be a big refactor.

snarfed commented 1 year ago

Thanks @kevinmarks!

Not a top priority for me personally, but moving away from BeautifulSoup might only be a medium sized refactor, spread across a few packages, and we could do them independently.

snarfed / bridgy

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535