miniflux / v2

Minimalist and opinionated feed reader
https://miniflux.app
Apache License 2.0
6.51k stars 706 forks source link

[Feature Request] Fetch original content and not shown images... #473

Open ctschach opened 4 years ago

ctschach commented 4 years ago

Again, more a feature request:

The "Fetch original content" together with the scrapper is a life-saving tool. However, some pages uses lazy load functions to include images or place image loading into noscript-tags. This means the images are not shown in the reader.

Having a search-and-replace function (probably based on regex) would allow us to adjust the tags, so that images are included properly into the article.

Due to the "a-img" tag instead of a plain "img" tag, the image is not shown.

<figure class="article-image">
<a-img width="16" height="9" layout="responsive" src="/imgs/18/2/7/6/0/7/1/8/Hassrede-4b031a0dc42fc046.jpeg" alt="Hassrede: Renate Künast geht gegen Gerichtsbeschluss zu Beschimpfungen im Netz vor" quality="85" high-dpi-quality="50" instant="" style="height: 0; padding-top: 56.25%;"><img class="a-size-defined" alt="Hassrede: Renate Künast geht gegen Gerichtsbeschluss zu Beschimpfungen im Netz vor" src="https://heise.cloudimg.io/width/610/q85.png-lossy-85.webp-lossy-85.foil1/_www-heise-de_/imgs/18/2/7/6/0/7/1/8/Hassrede-4b031a0dc42fc046.jpeg" style="display: block;"><div class="a-sizer" style="padding-top: 56.25%; display: block;"></div></a-img>
<noscript>
&lt;img
  src="https://heise.cloudimg.io/width/200/q50.png-lossy-50.webp-lossy-50.foil1/_www-heise-de_/imgs/18/2/7/6/0/7/1/8/Hassrede-4b031a0dc42fc046.jpeg"
  srcset="https://heise.cloudimg.io/width/200/q30.png-lossy-30.webp-lossy-30.foil1/_www-heise-de_/imgs/18/2/7/6/0/7/1/8/Hassrede-4b031a0dc42fc046.jpeg 2x"
  alt="Hassrede: Renate Künast geht gegen Gerichtsbeschluss zu Beschimpfungen im Netz vor"class=""
  style="width:100%;"
&gt;
</noscript>
</figure>
somini commented 4 years ago

Sounds like an addition to https://github.com/miniflux/miniflux/blob/b6f3160dbc3efe7a86d39d526a1780eb320eefd4/reader/rewrite/rewrite_functions.go#L79

Kunsi commented 3 years ago

Somehow related to this: Some sites use a static base64 encoded image in srcset attribute to show a loading image, which leads to the browser not loading the actual image inside miniflux.

Can be seen "in the wild" here: https://bahnblogstelle.net/2021/04/04/hamburger-entwickelt-suchmaschine-fuer-nachtzugreisen/ (the big image below the heading)

m0nhawk commented 3 years ago

I have managed to rewrite the srcset use with the following:

replace("srcset"|"")

After adding this rewrite rule all the images are finally shown.

decke commented 2 years ago

Thanks for the hint! That rewrite rule did the trick for me:

replace("<img "|"<ignore "),replace("a-img"|"img")

cryptoluks commented 2 years ago

Thanks for the hint! That rewrite rule did the trick for me:

replace("<img "|"<ignore "),replace("a-img"|"img")

For me, this loads the full sized shutterstock images, which are sometimes as large as 10M.

I instead used use_noscript_figure_images to use the smaller images of the noscript part. Moreover, removing .brandingand footer lead to a very clean article.

My complete rewrite rules for heise:

use_noscript_figure_images,remove(".branding,footer")