seazon / FeedMe

The documents and forum of FeedMe
1.14k stars 27 forks source link

Allow to customise internal HTML parser/decoder #622

Open jimbobmcgee opened 10 months ago

jimbobmcgee commented 10 months ago

The built in HTML decoder does not always recognise images as important content, and excludes them from the downloads. It would be nice to be able to apply to configure this in some way.

For instance, it may be useful to allow for specifying selectors that are always included, either at a top level or for specific feeds, e.g. img#main or div.main > img::nth-child(1). These would be patterns of extra content, in addition to that which you already extract.

It may also be useful to allow overriding the User-Agent string used when fetching a particular feed's content, in case the server sends different image qualities in response to UA.

(I note that the other parsers also sometimes ignore images, or download lower-resolution images - I figured being able to configure the internal parser might be more achievable than changing the behaviour of those external services.)

Again, thanks for continuing to maintain this app!

seazon commented 10 months ago

An example is better to help me to understand. I believe you ran into this problem on one of your feeds.

jimbobmcgee commented 10 months ago

Sure. Consider https://www.girlswithslingshots.com/comic/rss. The feed view that I see is:

Screenshot_2023-09-17-19-53-54-197~2

I can see that comes from the content of the page but, since the site is a webcomic, that text content isn't the most relevant detail. I don't expect FeedMe to know this automatically, but it would be useful if there was a way for me to tell FeedMe to include additional content on a feed-by-feed basis.

The Google Web Light version is better, but gets low-resolution images, which I guess is because of the target server's reaction to the Google user agent, and not something you can change. If I could select the content I wanted, I might also need to change the User-Agent, in case the server chooses the low-resolution images for FeedMe too.

seazon commented 9 months ago

Oh, just aware the html parser is mobilizer in feedme. The newest 4.0.4 fixed feedbin parser issue. Please try it.

jimbobmcgee commented 9 months ago

Perhaps I am not explaining this well enough.

I would like to have more control over content extraction done by the mobilizer. I assume that, from the existing three mobilizer options, the one called FeedMe mobilizer is written by you, so is the one that you have the most control over.

For example, given the following RSS XML item (taken from the example feed above):

<item>
  <title>Girls With Slingshots - GWS Hair of the Dog #226</title>
  <description>
    <a href="https://www.girlswithslingshots.com/comic/gws-hair-of-the-dog-226"><img src="https://www.girlswithslingshots.com/comicsthumbs/1696390192-GWS226.jpg" /><br />New comic!</a><br />Today's News:<br /><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <p>Based on a true story. (It IS more pleasant to... suck it.)</p><p><a href="https://www.girlswithslingshots.com/comic/gws226" target="_blank">Here's the original strip</a> aaaaand <a href="https://www.girlswithslingshots.com/comic/gws-chaser-226" target="_blank">here is the chaser!</a></p>
  </description>
  <link>https://www.girlswithslingshots.com/comic/gws-hair-of-the-dog-226</link>
  <author>tech@thehiveworks.com</author>
  <pubDate>Sun, 08 Oct 2023 22:00:43 -0400</pubDate>
  <guid>https://www.girlswithslingshots.com/comic/gws-hair-of-the-dog-226</guid>
</item>

This results in one of four possible views:

The feed view:

image

I presume this is directly taken from the <description> element. We can see in that this contains badly-formed embedded HTML.

The image is low-quality, because the embedded <img> tag references /comicthumbs/, which I guess is a deliberate decision by the server operator to serve a low-quality image to feed readers, and what I am eventually trying to overcome.

In this case, FeedMe does the best it can with what it has been given -- there is not much more that can be done to improve it.

The web view using the Feedbin mobilizer:

image

I presume this is Feedbin's parsing of the response it got from the <link> tag, however, it does not do a very good job in identifying the important content of the page. I don't use it very often, for exactly this reason.

However, since this is a third-party mobilizer, I presume that there is nothing more that you can do to improve this.

The web view using the Google Web Light mobilizer:

image

The Google mobilizer does a better job of extracting content from the page behind the <link> tag, but the image is still low-quality. I presume this is due to the webcomic's server software deliberately serving a lower-quality image to the Google mobilizer, which I also presume it determines by either the source IP address ranges or the User Agent header that Google uses.

Again, since this is a third-party mobilizer, I presume that there is nothing more that you can do to improve this.

The web view using the FeedMe mobilizer:

image

I presume the FeedMe mobilizer uses code that you have written to extract the HTML content from the web page behind the <link> tag. However, in this case, the content that you have picked is some random sidebar content from the web page and completely misses the image altogether.

I don't know how your code tries to identify the relevant part of the content, but I would like to be able to override (or at least influence) your logic. I don't expect your code to understand how every webpage is created, so I would like to be able to configure what is relevant for the feeds where your existing code does not work as well.

As such, I would like to have a feed setting where I can enter selectors for additional content that you should include in the mobilizer result. It would be an advanced setting, and would only affect the FeedMe mobilizer.

Something like (crude mockup):

image

I have suggested using the syntax of the modern browsers' JavaScript querySelector() method, but if there is an easier syntax for you to implement, that would be fine too.

I would also like to be able to configure the User Agent header that the FeedMe mobilizer uses to request HTML from servers, as a separate setting. This would allow me to overcome when the target server uses this header to restrict the content that is sent.

I have only installed FeedMe 4.04 today (and I look forward to that fix for (null) content). The above screenshots were taken from 4.04, but the cache content may well be from when it was 4.03. I will report back if 4.04 significantly improves the content I see, but I expect the Feedbin mobilizer will not have improved its own content recognition significantly.

seazon commented 9 months ago

I'm pretty sure this is the longest commit I've ever seen in FeedMe issues. Give me some time to read.

seazon commented 9 months ago

Read your comment, actually this is what FeedMe mobilizer 2.0 will do. The different to your idea is I would like to provide a preview window to help user check the right area they want, not via a technical selector. But this is not a easy work, so it is not implement yet.

I can try to implement your idea first, this won't take much time.

seazon commented 9 months ago

Show new input to enter the id or class, this input only shows when mobilizer is FeedMe. About low-quality image, if you load image of feed view, the web view also low-quality. You can enable Show Web when reading to get the high-quality image.

Screenshot 2023-10-14 at 22 15 25

Here is my demo: https://github.com/seazon/FeedMe/assets/2791800/6be27b26-5cc7-4c0c-bf33-dede6783cb21

Last, this will support in 4.1.

seazon commented 9 months ago

@jimbobmcgee https://github.com/seazon/FeedMe/blob/master/doc/en/mobilizer.md#feedme-mobilizer-selector-supported-since-v41 please check the doc first

jimbobmcgee commented 9 months ago

The documentation and demo both look great. I can imagine how difficult providing an interactive preview would be, so I think using a CSS selector is a reasonable alternative, at least to start with.

  1. Does your current approach let you capture more than one area (e.g. #cc-comic #blog) or would you have to find the common ancestor (#wrapper)?
  2. Does your current approach use a library that supports the full CSS selector syntax (e.g. main > div::nth-child(3)) or do you only support class and ID?
seazon commented 9 months ago
  1. Only one area.
  2. Only id and class.
seazon commented 9 months ago

4.1 released

ranggie4 commented 3 weeks ago

I realise that this is an old thread but since some good ideas came out of this, thought I'd share my 2cents. I know that this is going to be a lot of work but will it be possible to have a parsing ability using a similar method that Feed43 uses? Will be similar to using the current system of '#' for id and '.' for class but with the added ability to tag multiple areas and rearranging them to be parsed into the final output/article. Sorry if this doesn't make sense as I'm not a developer and my understanding of things might be too simplistic and I'm probably making a fool of myself here.

The bonus to having a similar method/workflow as Feed43 is that FeedMe users will be able to test out parsing 'recipes' for a webpage using Feed43's site (via either a desktop or mobile) before implementing it on the FeedMe app.

Finally, perhaps after this we could have a crowdsourced 'parsing recipes' if you will and post it in the documentation area of this project. Who knows, once a modular approach is implemented, FeedMe users could simply get a recipe for whatever websites they're using, and simply add it to the mobiliser setting, or perhaps a plugin?

I would like to help with the UX/UI as well as the documentations,

seazon commented 2 weeks ago

Feed43 is too technical to normal user. As I mentioned before, a tap to select the text area is what feedme mobilizer want to do.