j0k3r / graby

Graby helps you extract article content from web pages
MIT License
368 stars 73 forks source link

Removing empty nodes early breaks medium.com images #337

Open Kdecherf opened 1 year ago

Kdecherf commented 1 year ago

Considering the following snippet from a medium.com article:

<figure class="pd pe pf pg ph md mo mp paragraph-image">
   <div role="button" tabindex="0" class="pi pj ff pk bg pl">
      <div class="mo mp pc">
         <picture>
            <source srcset="[…]" type="image/webp"></source>
            <source data-testid="og" srcset="[…]"></source>
            <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
         </picture>
   </div>
</div>
<figcaption class="ml mm mn mo mp mq mr be b bf z dt">Dow</figcaption></figure>

Currently graby removes these source tags because of this routine in Graby.php:

        // Remove empty lines to avoid runaway evaluation of following regex on badly coded websites
        $re = '/^[ \t]*[\r\n]+/m';
        $htmlCleaned = preg_replace($re, '', $html);

However, not keeping these tags actually breaks images because img does not define any src path (thanks medium).

As this routine is run before ContentExtractor, we can't use find/replace in site-config to prevent that.

Thus, I see two ways to deal with it, whether:

The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.

Any thoughts @j0k3r @jtojnar?

On a side note, no, medium.com is not compliant with HTML specification as the source tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it

jtojnar commented 1 year ago

I think the tidy step is somewhat valuable (maybe we could even extend it to remove img without src) so I would add source and other void elements to the list. It is not like new HTML elements appear that often.


Perhaps the medium page is XHTML5 page? [Edit: Nevermind, they do not close the img.] For XML parser, self-termination should be equivalent to no content followed by a closing tag. I verified it with validator on the following document:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
 <picture>
  <source srcset="[…]" type="image/webp"></source>
  <source data-testid="og" srcset="[…]"></source>
  <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
 </picture>
</body>
</html>

Needs to be uploaded as xhtml file, direct input does not detect XML correctly.

Kdecherf commented 1 year ago

maybe we could even extend it to remove img without src

Removing img without src attribute would break picture tags containing it, even if source tags are present, see https://github.com/wallabag/wallabag/issues/6414#issuecomment-1676086996

j0k3r commented 1 year ago

I think the fastest is to add source to the list. About adding more elements to the list, is that one enough? https://developer.mozilla.org/en-US/docs/Glossary/Void_element

Funny I didn't see the iframe element in the list..

jtojnar commented 1 year ago

Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty tds are needed for empty table cells because removing them would jumble the table. And while we could enumerate void elements quite easily, the non-void ones are much harder since there is probably no official list of elements that are meaningful without content.

Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. p|span|font). That would give us at least some tidiness. We could also log all other empty elements not in the whitelist to allow us to grow coverage over time.

But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete?

j0k3r commented 1 year ago

I'm 👍 for the whitelist

Kdecherf commented 1 year ago

Noted @j0k3r, I'll take care of that