shouya / rss-funnel

Self-hosted RSS multi-tool
https://rss-funnel-demo.fly.dev
GNU General Public License v3.0
112 stars 4 forks source link

Ensure `simplify_html` Provides Complete Output #121

Closed tillcash closed 4 months ago

tillcash commented 4 months ago

I believe the simplify_html function requires the full_text method to work effectively. Mentioning only simplify_html on the URL parameter can cause confusion for end users, as it does not provide the expected output on its own.

If it's not possible to modify the function, we can update the wiki to highlight that for readability purposes, both simplify_html and full_text need to be used together.

shouya commented 4 months ago

The simplify_html filter takes in a feed whose entries may contain excessive html tags in the body and strip those tags away. I'm supposing it's mostly useful if the body was fetched using full_text filter, but simplify_html and full_text, in my humble opinion, refer to two independent way to processing the feed.

Sorry I may have misunderstood your question here. I'm a bit confused what you mean by "Mentioning only simplify_html on the URL parameter"? Could you elaborate a bit?

Oh, by the way, the wiki should be open for everyone to edit. Please feel free to edit the wiki directly if you see fit.

tillcash commented 4 months ago

The simplify_html filter takes in a feed whose entries may contain excessive html tags in the body and strip those tags away.

I think this information is missing in the wiki, which causes confusion. I tried a couple of feeds that do not have full content, like 127.0.0.1:4080/otf?source=https://www.thehindu.com/sci-tech/health/feeder/default.rss&limit=1&simplify_html, and it does nothing since I didn't know it strips the HTML tags from the feed content.

So, I opened this issue to suggest that simplify_html should auto-run full_text to provide the full content. Will you consider adding a new function that combines both full_text and simplify_html for simplicity purposes?

Additionally, I suggest refining the keep_only / discard filter to initially apply only to the title by default.

tillcash commented 4 months ago

I have updated the wiki entry for simplify_html. Please provide guidance accordingly.

shouya commented 4 months ago

Thank you for your clarification. I have made some changes on top of your update.

Note that different feed formats have different fields for the content. For Atom the content is sometimes found in the <content> or <summary> tag, and for RSS the body is more often found in <description>. It is a quite messy thing to deal with. From the code and some docs, I used the generic term "body" to avoid confusion.