miniflux / v2

Minimalist and opinionated feed reader
https://miniflux.app
Apache License 2.0
6.84k stars 721 forks source link

Weird figure without caption #200

Closed somini closed 6 years ago

somini commented 6 years ago

The following URL: https://www.bbc.co.uk/news/technology-44482380

Produces a figure with the following HTML:

<figure>
  <img alt="Cartoon of Tony the IT admin receiving an email from a hacker" src="/proxy/aHR0cHM6Ly9pY2hlZi5iYmNpLmNvLnVrL25ld3MvMzIwL2Nwc3Byb2RwYi8xODdFL3Byb2R1Y3Rpb24vXzEwMjgwNzI2MF9jeWJlcnNlY18xX2hpZ2hyZXMuanBn">
                 Image copyright
                 TOM HUMBERSTONE
  <figcaption>
                Image caption
                    The day starts badly for IT administrator Tony Lewis when he reads an email from a hacker
  </figcaption>
</figure>

That "Image copyright" text is not inside a caption, so it gets rendered as regular text, inline with the image. It should be put inside a figcaption like the rest, so it can be rendered nicely.

I think this comes from this piece of code, but I'm not sure. https://github.com/miniflux/miniflux/blob/6d25e02cb5e3ddd9f3fc8dfb44cf29c98866d587/reader/rewrite/rewrite_functions.go#L34

somini commented 6 years ago

Duh, it's not the rewriter. This is the source HTML.

<figure class="media-landscape has-caption full-width lead">
            <span class="image-and-copyright-container">
                <img class="js-image-replace" alt="Cartoon of Tony the IT admin receiving an email from a hacker" src="https://ichef.bbci.co.uk/news/660/cpsprodpb/187E/production/_102807260_cybersec_1_highres.jpg" data-highest-encountered-width="660" width="976" height="549">
                 <span class="off-screen">Image copyright</span>
                 <span class="story-image-copyright">TOM HUMBERSTONE</span>
            </span>
            <figcaption class="media-caption">
                <span class="off-screen">Image caption</span>
                <span class="media-caption__text">
                    The day starts badly for IT administrator Tony Lewis when he reads an email from a hacker
                </span>
            </figcaption>
        </figure>

Should I just use the rewriter, or is it worth to special case wrapping all free text inside figure with figcaption?

dzaikos commented 6 years ago

I doubt the add_image_title rewriter is doing this. The source HTML doesn't have a title attribute in the img tag, so the rewriter wouldn't even process it.

The problem, I think, is that the markup is invalid. They're using span tags (which are inline) with img tags (which are block). This isn't really the way you're supposed to do it.

If the rewriter finds other img tags with title attributes in the source it does run the source through GoQuery, which affects the markup (behind the scenes it's interpreting the HTML, similar to a web browser). But that doesn't change the fact that their source is a little messy. Such is the web, though.

Do you need the rewriter for a site that's already putting in captions?

Edit: Come to think of it, the entry is shown on the UI after being run through GoQuery* for the proxy, etc. so the markup is being changed as a result of that regardless of the rewriter. I don't think there's a solution other than the source site fixing their markup.

*I haven't gone through GoQuery's code but I'm guessing it's just abstracting GoLang's HTML processor. The end result, regardless, is it trying to "fix" invalid markup (again, just like a browser would).

somini commented 6 years ago

I double checked and the rewriter is not configured for that feed, it's pure Web messiness.

I'll live with it, this is not a "bug" on Miniflux per-se.