miniflux / v2

Minimalist and opinionated feed reader
https://miniflux.app
Apache License 2.0
6.67k stars 708 forks source link

Some feeds not loading images #2461

Open j0ra0 opened 6 months ago

j0ra0 commented 6 months ago

Hi,

I have tried to search for a solution, but have not yet found one, so I'm sorry if this has been up before. Some of my newsfeeds (e.g. www.adressa.no) do not load images. The html from the original web page look like this:

<figure class="media image ratio-wide original">
<div class="ratio zoomable" style="padding-top: 66.67%;">
<img src="https://vcdn.polarismedia.no/9095794f-7425-4e20-b76f-e3389c06f60c?fit=clip&amp;h=600&amp;q=80&amp;tight=false&amp;w=800" alt="" class="loaded">
</div> 
</figure>

and Miniflux changes it to the following,

<figure>
<img src="" alt="" loading="lazy">
</figure>

Is there a fix? I really love having all my news in Miniflux.

fguillot commented 5 months ago

Did you try to use the add_dynamic_image rewrite rule?

www.adressa.no has a Paywall. I can't access the articles.

j0ra0 commented 5 months ago

Thank you for your suggestion. I did try that already without any luck. Some articles are free, like these:

https://www.adressa.no/forbruker/i/P4Wm2R/de-har-laget-paaskequiz-fra-troendelag-folk-har-virkelig-bydd-paa-seg-sjoel

https://www.adressa.no/kultur/i/ondnV7/veien-til-ol-blir-dokumentarserie-det-er-mye-mer-enn-bare-slaassing

Btw, I can usually load them first picture in the article, but no more.

j0ra0 commented 3 months ago

Anyone?😇

hyyyjinx commented 3 months ago

similar issue here for the following feed (not paywalled)

https://api.quantamagazine.org/feed/

example article: https://www.quantamagazine.org/mathematicians-attempt-to-glimpse-past-the-big-bang-20240531/

in miniflux (with add_dynamic_image rule applied):

image

note, the first image of the article is added. additional images of the authors are missing.

ztec commented 3 months ago

About the case of @hyyyjinx : The reason the image is not displayed in miniflux despite the rewrite rule is because the image is inserted using a specific JavaScript code

The image is actually in this format in the page source (from https://www.quantamagazine.org/mathematicians-attempt-to-glimpse-past-the-big-bang-20240531/)

<div id='component-6669b8ad24623'>
<script type="text/template">{"type":"Image","id":"component-6669b8ad24623","data":{"id":138269,"src":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","alt":"","class":"","width":1350,"height":1409,"mobileSrc":false,"zoomSrc":false,"mobileZoomSrc":false,"align":"align=\"inline\"","wrapper_width":"","caption":"
<p>Working together with Jerome Quintin and Eric Ling,
Ghazal Geshnizjani of the Perimeter Institute examined
ways in which space-time might be extended beyond the
Big Bang.<\/p>\n","attribution":"<p>Evan Pappas,
Perimeter
Institute<\/p>\n","variant":"shortcode","size":"default","disableZoom":false,"disableMobileZoom":false,"srcImage":{"ID":138269,"id":138269,"title":"GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02","filename":"GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","filesize":850028,"url":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","link":"https:\/\/www.quantamagazine.org\/mathematicians-attempt-to-glimpse-past-the-big-bang-20240531\/ghazalgeshnizjani-crevanpappas_perimeterinstitute-02\/","alt":"","author":"52094","description":"","caption":"","name":"ghazalgeshnizjani-crevanpappas_perimeterinstitute-02","status":"inherit","uploaded_to":138260,"date":"2024-05-30
18:52:07","modified":"2024-05-30
18:52:07","menu_order":0,"mime_type":"image\/webp","type":"image","subtype":"webp","icon":"https:\/\/api.quantamagazine.org\/wp-includes\/images\/media\/default.png","width":1350,"height":1409,"sizes":{"thumbnail":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02-498x520.webp","thumbnail-width":498,"thumbnail-height":520,"medium":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","medium-width":1350,"medium-height":1409,"medium_large":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02-768x802.webp","medium_large-width":768,"medium_large-height":802,"large":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","large-width":1350,"large-height":1409,"1536x1536":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","1536x1536-width":1350,"1536x1536-height":1409,"2048x2048":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02.webp","2048x2048-width":1350,"2048x2048-height":1409,"square_small":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02-160x160.webp","square_small-width":160,"square_small-height":160,"square_large":"https:\/\/d2r55xnwy6nx47.cloudfront.net\/uploads\/2024\/05\/GhazalGeshnizjani-crEvanPappas_PerimeterInstitute-02-520x520.webp","square_large-width":520,"square_large-height":520}},"largeForPrint":false,"externalLink":"","original_resolution":false}}</script>
</div>

I suspect there is a script that find all the <script type="text/template"> and parse the JSON inside. It feel like something highly specific to this website.

Hope this can help someone find a rewrite rule that works.

ztec commented 3 months ago

Regarding the original @j0ra0 case: Using https://www.adressa.no/forbruker/i/P4Wm2R/de-har-laget-paaskequiz-fra-troendelag-folk-har-virkelig-bydd-paa-seg-sjoel as analysis base:

The actual source content is the one Miniflux sees:

<figure>
<img src="" alt="" loading="lazy">
</figure>

The final html is generated by JavaScript and uses a data structure in the page that look like this

{
    caption: {value: "Dette vil du se mer av i påska!"},
    byline: {title: M},
    imageAsset: {id: "3ad9bd86-3263-4391-85e6-d6f014335811", size: {width: E, height: F}},
    type: H
},

To generate

<figure class="media image ratio-wide original"><div class="ratio zoomable" style="padding-top: 69.03%;"><img src="https://vcdn.polarismedia.no/3ad9bd86-3263-4391-85e6-d6f014335811?fit=clip&amp;h=500&amp;q=80&amp;tight=false&amp;w=700" alt="" class="loaded"></div> <figcaption>
  Dette vil du se mer av i påska!
  <span><strong>Foto:</strong> Adresseavisen</span></figcaption></figure>

This feel highly specific to this website and I don't see any ways to solve that in a generic way.

hyyyjinx commented 3 months ago

For the time being, i resorted to opening external links for all feed entries over the reader view as there are too many issues across sites.

There seem to be a fair amount of site specific gotchas that cause rendering problems with miniflux: Missing images, videos not loading, duplicate images, superflous text ("Related Links" etc). Solving all of these feels like an immense undertaking and playing whack-a-mole. That would not fit well in a project that values 'simplicity' in design.

It does however impact the user experience pretty severely and potentially deters new users from miniflux.

Maybe the minimized/compact reader view is a problem too complex for miniflux to solve and should instead be 'outsourced'. The firefox readbility library is open source and would potentially be a more robust solution: https://github.com/mozilla/readability

There might be oder libraries available as well to solve this issue, the ff lib was just the first i stumbled upon that seems well maintained.

@fguillot Do you have any thoughts on this? Do you see offloading site rendering to an external lib to fit within the miniflux project scope and roadmap in the long run?

ztec commented 3 months ago

I've also seen people use RSS translator in between the original feed and minifkux. Delegating this kind of analysis and behaviour to external software specifically tailored for this purpose. This could be a more robust solution that does not imply doing this whack-a-mole inside minifkux itself.

fguillot commented 4 days ago

All the problems mentioned in this issue are related to JavaScript. Miniflux's web scraper does not interpret JavaScript. It fetches the HTML web page and sanitizes the content. That's it. Optionally, you can define your own rules to fetch the relevant content if Miniflux's Readability implementation does not work well.

These websites are not designed to fall back gracefully if JavaScript is disabled or not supported by the user agent.

The default HTML content provided in the RSS feed works fine. The web scraper is an optional feature, and it won't work all the time for websites that rely solely on JavaScript.

If you are looking for something more advanced that can interpret JavaScript, then the best solution is to use a headless browser to scrape websites, but that complicates everything.

Or... just click on the article link and view it in your web browser. After all, there must be a reason why these websites show only an excerpt of articles in their RSS feeds.