miniflux / v2

Minimalist and opinionated feed reader
https://miniflux.app
Apache License 2.0
6.97k stars 727 forks source link

srcset replacement on scraping resulting in missing images #2759

Closed gudvinr closed 3 months ago

gudvinr commented 4 months ago

Here's original element from the page:

<figure class="medium-image" style="position: relative; padding-bottom: 51.0%; margin: 0" itemprop="associatedMedia" itemscope="" itemtype="http://schema.org/ImageObject">
      <img itemprop="thumbnail" class="medium-image" src="https://example.com/images/21365/500px.jpg" alt="Alt description with whatever content" srcset="images/21365/500px.jpg 500w, images/21365/1000px.jpg 1000w, images/21365/1500px.jpg 1500w" sizes="100vw" style="position: absolute" width="500" height="255">
      <meta itemprop="caption" content="Alt description with whatever content">
</figure>

I've added scraping rule [itemprop="associatedMedia"] and miniflux renders this element as such:

<figure>
    <img src="https://example.com/images/21365/500px.jpg" alt="Alt description with whatever content" srcset="https://example.com/path/to/page/images/21365/500px.jpg 500w, https://example.com/path/to/page/images/21365/1000px.jpg 1000w, https://example.com/path/to/page/images/21365/1500px.jpg 1500w" sizes="100vw" width="500" height="255" loading="lazy">
</figure>

srcset here prepended with https://example.com/path/to/page/, which is the root of "External link" that looks like /https://example.com/path/to/page/(\d+)/. That results in 404.

However, browser downloads these images from https://example.com/images/21365/1500px.jpg (see missing /path/to/page).

fguillot commented 4 months ago

Can you provide the feed URL and/or the link to this particular webpage?

gudvinr commented 4 months ago

I believe this is from this page but pretty much every article in this feed behaves like this

gudvinr commented 4 months ago

@fguillot I think the issue is that miniflux doesn't account for <base> tag which is present on the page