gugray / rss-parrot

Notifies Mastodon accounts about new posts in the RSS feeds they follow
https://rss-parrot.net
MIT License
109 stars 7 forks source link

Birb can't add feed if it has invalid XML characters #12

Closed bannmann closed 2 months ago

bannmann commented 7 months ago

Requested feed: https://www.captain-futura.de/feed/

Birb responds:

Hm, I can't find a feed for this site.

Taking the "for this site" part verbatim, I also tried to enter the site URL https://www.captain-futura.de/ instead of its feed URL, but got the same result.

gugray commented 7 months ago

It appears the feed contains an invalid character. Go's built-in XML parser is pretty finicky about this, so it complains with this error: XML syntax error on line 535: illegal character code U+0003

I've seen this in about 3% of the 10k+ feeds from the Kagi Small Web that I imported before the release. I'll get back to this later and look for a way to sanitize the input if it has this kind of easy-to-fix problems.

gwater commented 7 months ago

I suspect this is also messing with youtube's atom feeds at the moment.

For example, this youtube feed was set up a while ago and apparently works fine

https://rss-parrot.net/web/feeds/www.youtube.com.channel.ucb9awk0xfz2p-ny2nsxghra

however new youtube feeds get the

Hm, I can't find a feed for this site.

response

FueledxJarritos commented 7 months ago

Also confirming that it's affecting Youtube feeds

gugray commented 7 months ago

Youtube problem is unrelated, opening a different issue for it.

eiland-asbl commented 3 months ago

This feed also isn't accepted: https://fd.nl/?rss Would be nice if it would eturn that a feed /is/ found but that it can't be processed

gugray commented 2 months ago

This is the same issue as the earlier one about YT feeds. The birb trims the query parameter from feed URLs (everything after the question mark) because that is most often used to produce redundant machine-generated view of the same underlying stream of posts.

In rare cases the single legit feed URL is defined by including a query parameter. Unfortunately the RSS Parrot currently cannot follow these in the generic case. I added exceptions only for big sites like YT that do this.

For now I'm sticking with this decision, but I might come back to re-evaluate it in the future if it turns out to be a major pain point.