Open Kdecherf opened 1 year ago
I think the tidy step is somewhat valuable (maybe we could even extend it to remove img
without src
) so I would add source
and other void elements to the list. It is not like new HTML elements appear that often.
Perhaps the medium page is XHTML5 page? [Edit: Nevermind, they do not close the img
.] For XML parser, self-termination should be equivalent to no content followed by a closing tag. I verified it with validator on the following document:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
<picture>
<source srcset="[…]" type="image/webp"></source>
<source data-testid="og" srcset="[…]"></source>
<img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
</picture>
</body>
</html>
Needs to be uploaded as xhtml file, direct input does not detect XML correctly.
maybe we could even extend it to remove
img
withoutsrc
Removing img
without src
attribute would break picture
tags containing it, even if source
tags are present, see https://github.com/wallabag/wallabag/issues/6414#issuecomment-1676086996
I think the fastest is to add source
to the list.
About adding more elements to the list, is that one enough? https://developer.mozilla.org/en-US/docs/Glossary/Void_element
Funny I didn't see the iframe
element in the list..
Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty td
s are needed for empty table cells because removing them would jumble the table. And while we could enumerate void elements quite easily, the non-void ones are much harder since there is probably no official list of elements that are meaningful without content.
Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. p|span|font
). That would give us at least some tidiness. We could also log all other empty elements not in the whitelist to allow us to grow coverage over time.
But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete?
I'm 👍 for the whitelist
Noted @j0k3r, I'll take care of that
Considering the following snippet from a medium.com article:
Currently graby removes these
source
tags because of this routine inGraby.php
:However, not keeping these tags actually breaks images because
img
does not define any src path (thanks medium).As this routine is run before ContentExtractor, we can't use
find/replace
in site-config to prevent that.Thus, I see two ways to deal with it, whether:
The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.
Any thoughts @j0k3r @jtojnar?
On a side note, no, medium.com is not compliant with HTML specification as the
source
tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it