mozilla / readability

A standalone version of the readability lib
Other
8.94k stars 607 forks source link

Hero image removed on some (but not all) bbc.com articles #900

Open maxpatiiuk opened 3 months ago

maxpatiiuk commented 3 months ago

Some pages on bbc.com are missing the hero image.

bbc.com articles where Readability removes the hero image:

bbc.com articles where the hero image is preserved:

Example markup of removed image:

<figure><div data-component="image-block" class="sc-18fde0d6-0 EXUng"><div data-testid="hero-image" class="sc-814e9212-1 fcEyBx"><img sizes="(min-width: 1280px) 50vw, (min-width: 1008px) 66vw, 96vw" srcset="https://ichef.bbci.co.uk/news/240/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 240w,https://ichef.bbci.co.uk/news/320/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 320w,https://ichef.bbci.co.uk/news/480/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 480w,https://ichef.bbci.co.uk/news/640/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 640w,https://ichef.bbci.co.uk/news/800/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 800w,https://ichef.bbci.co.uk/news/1024/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 1024w,https://ichef.bbci.co.uk/news/1536/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp 1536w" src="https://ichef.bbci.co.uk/news/480/cpsprodpb/538b/live/d57eadd0-514f-11ef-986a-a10b7a6886df.jpg.webp" loading="eager" alt="Reuters Fire damage is shown in the Wahikuli Terrace neighborhood in the fire ravaged town of Lahaina" class="sc-814e9212-0 hIXOPW"><span class="sc-814e9212-2 jesyMJ">Reuters</span></div></div></figure>

Example markup of preserved image:

<figure><div data-component="image-block" class="sc-18fde0d6-0 EXUng"><div data-testid="hero-image" class="sc-814e9212-1 fcEyBx"><img sizes="(min-width: 1280px) 50vw, (min-width: 1008px) 66vw, 96vw" srcset="https://ichef.bbci.co.uk/images/ic/160xn/p0jds5n7.jpg.webp 160w,https://ichef.bbci.co.uk/images/ic/240xn/p0jds5n7.jpg.webp 240w,https://ichef.bbci.co.uk/images/ic/320xn/p0jds5n7.jpg.webp 320w,https://ichef.bbci.co.uk/images/ic/480xn/p0jds5n7.jpg.webp 480w,https://ichef.bbci.co.uk/images/ic/640xn/p0jds5n7.jpg.webp 640w,https://ichef.bbci.co.uk/images/ic/800xn/p0jds5n7.jpg.webp 800w,https://ichef.bbci.co.uk/images/ic/1024xn/p0jds5n7.jpg.webp 1024w,https://ichef.bbci.co.uk/images/ic/1376xn/p0jds5n7.jpg.webp 1376w,https://ichef.bbci.co.uk/images/ic/1920xn/p0jds5n7.jpg.webp 1920w" src="https://ichef.bbci.co.uk/images/ic/480xn/p0jds5n7.jpg.webp" loading="eager" alt="Netflix Jeff Goldblum in Netflix's Kaos (Credit: Netflix)" class="sc-814e9212-0 hIXOPW"><span class="sc-814e9212-2 jesyMJ">Netflix</span></div></div></figure>

Nothing obvious stands out in the markup - looks quite similar.

I am no expert at readability.js source code, but from what I tried to debug it looks like during scoring the images get removed - in some articles, the image is above the threshold, in others not.

Maybe related or maybe not to these:

fchasen commented 2 months ago

Thanks for all the example, they will be very useful in debugging this but I also don't see any immediate difference so will have to look see what is up with the scoring for these.