j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

ContentExtractor: remove logic for wp img lazy loading #251

Open Kdecherf opened 3 years ago

Kdecherf commented 3 years ago

I triggered a bug on a page with a wordpress image lazy loading system based on noscript tags. In the case of this page, two noscript tags were nested for the same image. The page fetch returned two different results depending on the libxml version installed on the Debian and Alpine systems I used for the test.

As far as we already have a logic to move data-lazy-* values in src and srcset, I consider that the whole logic around the noscript tag is not needed anymore.

The page used to trigger the bug: https://petapixel.com/2021/01/20/a-face-mask-can-double-as-a-flash-diffuser-for-better-portraits/

A HTML snippet of this page:

<p>
   <img loading="lazy" src="https://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg" alt width="800" height="701" class="aligncenter size-large wp-image-505005 jetpack-lazy-image" data-lazy-srcset="https://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg 800w, https://petapixel.com/assets/uploads/2021/01/maskback-320x281.jpg 320w, https://petapixel.com/assets/uploads/2021/01/maskback.jpg 1080w" data-lazy-sizes="(max-width: 800px) 100vw, 800px" data-lazy-src="http://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg?is-pending-load=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">
   <noscript>
      <img loading="lazy"  alt="" width="800" height="701"  data-srcset="https://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg 800w, https://petapixel.com/assets/uploads/2021/01/maskback-320x281.jpg 320w, https://petapixel.com/assets/uploads/2021/01/maskback.jpg 1080w"  data-src="http://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg" data-sizes="(max-width: 800px) 100vw, 800px" class="aligncenter size-large wp-image-505005 lazyload" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" />
      <noscript>
         <img loading="lazy" src="http://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg" alt="" width="800" height="701" class="aligncenter size-large wp-image-505005" srcset="https://petapixel.com/assets/uploads/2021/01/maskback-800x701.jpg 800w, https://petapixel.com/assets/uploads/2021/01/maskback-320x281.jpg 320w, https://petapixel.com/assets/uploads/2021/01/maskback.jpg 1080w" sizes="(max-width: 800px) 100vw, 800px" />
      </noscript>
   </noscript>
</p>

This change was enough to fix the issue for this page. However I'm not fully aware of all use cases for this lazy loading/noscript mechanism so maybe more discussion is needed to ensure that we don't break anything else.

coveralls commented 3 years ago

Coverage Status

Coverage decreased (-0.02%) to 96.479% when pulling a0c464da360285f9fc5b3b84cf389757e1a5e757 on Kdecherf:fix/lazy-wp-noscript into e1154098851dd347002314445c18847655f5a289 on j0k3r:master.

jtojnar commented 3 years ago

See also https://github.com/j0k3r/graby/issues/240

jtojnar commented 3 years ago

One issue is that this preserves the noscript tag. Ideally we would remove it:

<p><a href="https://uxmovement.com/category/buttons/" title="Buttons" class="st-term-108 st-accent-bg">Buttons</a></p><div class="st-post-content st-entry st-clr" itemprop="text"><p>A user made the mistake of posting their private information publicly. They had accidentally set their account to public when they wanted it set to private. What inherently caused this serious error wasn’t the user. It was a bad toggle button design.</p><p>The user misperceived the toggle button that controls the public and private modes. The cue for the active state looked inactive to them. Their misperception was caused by the inverted color cues.</p><p>Inverted color cues give the active state a dark or light color, while the inactive one gets the inverse. The problem is that users can interpret either state as active due to their equal color contrast.</p><h2>Equal Color Contrast Ratios</h2><p>Color contrast indicates visual prominence, and prominence indicates an option selection. The active state should have a much higher color contrast than the inactive one. But when you use inverted color cues, it isn’t clear which button is more prominent.</p><p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-equalcontrastratio.png" class="aligncenter size-full wp-image-32206" alt="" width="510" height="222" /></p><noscript><img class="aligncenter size-full wp-image-32206" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-equalcontrastratio.png" alt="" width="510" height="222" /></noscript><p>When you calculate the color contrast ratio of the toggle buttons, you get the same value. The same value means both buttons are competing for attention, resulting in a tie.</p><p>Some think the solution is to apply a color hue to the active selection. However, this would only cause users to confuse <a href="https://uxmovement.com/buttons/why-toggle-buttons-should-never-look-like-action-buttons/">toggle buttons with action buttons</a>. This confusion occurs because they’re sharing the same cues.<img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-cueconfusion.png" class="aligncenter size-full wp-image-32197" alt="" width="510" height="222" /></p><noscript><img class="aligncenter size-full wp-image-32197" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-cueconfusion.png" alt="" width="510" height="222" /></noscript><p>They’re both buttons, but they have different affordances. A toggle button represents selected options and states. An action button performs a process or command on the system. Different affordances need different cues.</p><p>Instead of using inverted colors or color hues, the solution is to use visual depth. Depth cues use spatial contrast to help users distinguish the active state.</p><p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-cuetouse.png" class="aligncenter size-full wp-image-32198" alt="" width="600" height="150" /></p><noscript><img class="aligncenter size-full wp-image-32198" src="https://uxmovement.com/wp-content/uploads/2021/01/icc-cuetouse.png" alt="" width="600" height="150" /></noscript><p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5526975/" target="_blank" rel="noopener">Research shows</a> that elements closer in space have a significant impact on visual working memory. Users place more attention on closer elements than ones further away. Therefore, active toggles should appear closer than inactive ones. To achieve this visual effect, you have to apply surface shading and occlusion to your buttons.</p><h2>Access Full Article</h2><p>The full article goes into detail on how to apply surface shading and occlusion on different background colors. You’ll see more examples and learn design techniques to enhance your depth cues. Subscribe to access the full article.</p><p class="c1"><a class="st-theme-button" href="https://uxmovement.substack.com/subscribe?coupon=8f63b414">Access Full Article</a></p></div><section class="st-related-posts-wrap st-clr"></section>
Kdecherf commented 3 years ago

@jtojnar I don't see it as an issue as in this way it preserves the actual page content.

I'm not sure we should strip this kind of content, any though @j0k3r?

j0k3r commented 3 years ago

If we didn't remove that noscript tag, we might have duplicated image in the content. Am I right?

Kdecherf commented 3 years ago

If we didn't remove that noscript tag, we might have duplicated image in the content. Am I right?

It depends. Browsers should respect it, thus not showing duplicates. However we do have duplicates inside the DOM (even if they are not actively shown).

j0k3r commented 3 years ago

What do you think @jtojnar ? I'm ok to keep the noscript

jtojnar commented 3 years ago

I think it should be removed since otherwise the image might be duplicated (think webview in an app with javascript disabled for security). Could we just add a htmlawed rule removing all noscript elements at the end of processing?