Missing images in full content mode

jocmp commented 2 weeks ago

Background

Capy Reader uses a library called Readability4J that has a few rules to parse the article's full content.

Sometimes those rules fail leading to missing images in Capy's full content mode. This is an annoying issue without a single fix-all solution. Every website is different and changes over time which is part of the beauty and chaos of the web.

If you run into this issue with a feed, please post a link to the feed with an example to this thread. I'll track these to fix some point in the future. Thanks!

Feeds

PhilC813 commented 1 day ago

The articles' main image isn't shown in Capy's full content mode for the following feed: https://mobilesyrup.com/feed/

Article example: https://mobilesyrup.com/2024/11/28/google-releases-ai-generated-pieces-chess-game/

(I only noticed this today so maybe it used to work?)

HTML of the image: <img fetchpriority="high" width="1867" height="1046" src="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg" class="attachment-full size-full wp-post-image" alt="" decoding="async" srcset="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg 1867w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-300x168.jpg 300w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1024x574.jpg 1024w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-768x430.jpg 768w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1536x861.jpg 1536w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-417x235.jpg 417w" sizes="(max-width: 1867px) 100vw, 1867px" />

jocmp commented 18 hours ago

@PhilC813 an update. I'm toying around with Mercury Parser again and seeing some potential upsides. Here's a comparison of a Les Versants article.

Before	After

Mobile Syrup

Before	After

PhilC813 commented 18 hours ago

Waw, seems very promising.

Do you mind checking with this article? https://mobilesyrup.com/2024/11/28/here-are-the-2024-staples-black-friday-deals/

It's an article with Black Friday deals, and the current parser basically removes all the bullet points in which the deals are listed 😅

jocmp commented 17 hours ago

The new parser skips over lists by default, but with a little bit of code it works: https://github.com/jocmp/capyreader/pull/569/files#diff-a5310ab57bf17835286b2a012ceca522b0f9af190ceeea2dcf80c52f82c6479dR41-R49

PhilC813 commented 17 hours ago

So you can easily specify the <ul> tag as an exception, sweet. Frankly I don't really see a reason why they would be excluded by default. They are more likely to be content than ads.

Also, is there any parser that is still actively maintained? Mercury seems abandoned like Readability4J. It's not necessarily a problem, but having an active project is always a +.

jocmp commented 17 hours ago

Couldn't agree more. I think Mercury is more extensible and maintainable between the two. I forked it and I'm working on bringing its dependencies up to date here: https://github.com/jocmp/mercury-parser.

jocmp / capyreader

Missing images in full content mode #506

Background

Feeds