huginn / huginn

Create agents that monitor and act on your behalf. Your agents are standing by!
MIT License
43.14k stars 3.75k forks source link

RSSAgent #3215

Open mcanady opened 1 year ago

mcanady commented 1 year ago

Hi, About a week ago, Indeed.com RSS feeds started adding the following text at the top of their RSS feeds (ignore "/"):

/<?xml version='1.0' encoding='UTF-8'?/>

Now, RSSAgent will not read the feed as it doesn't start with (ignore "/") "/<rss version="2.0" ..." as I think it used to. No events are output. There are several errors in the log, not sure which one is the one to report. When I put the feed into a few RSS validators, the output is to the effect of "this is not an RSS feed" although the rest of the file is in RSS format. This error is making me think that the extra code above is the problem. The Indeed "RSS" link is below:

https://rss.indeed.com/rss?q=biotech&l=San+Diego+County%2C+CA&fromage=1

My question is, is there any way I can get RSSAgent to read this feed? If not, know that I could use Website Agent to read it, but it will take me some time to figure out how to do this. If someone could show me some example code to get Website Agent to read the above RSS feed, I'd appreciate it.

HumanG33k commented 1 year ago

Hello,

First you can use follow https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks to insert work xml tag.

Do you have any error output somewhere ? I guess (without more investigation) that indeed maybe not follow rss but for me their rss look valid and following https://www.rssboard.org/rss-specification

mcanady commented 1 year ago

Thanks but to be clear to anyone who's following this, that comment doesn't fix my problem.

HumanG33k commented 1 year ago

As i say previously can you provide error logs ? It will help us to help you. You can get one buy running a dry test and provide us output.

mcanady commented 1 year ago

Thanks--here are two RSSAgent error logs, I don't know which one is indicating the problem. Note that the feeds are definitely not empty, and that they should be producing numerous events. All my RSSAgent jobs fetching Indeed RSS items abruptly stopped Feb. 6th, and they've been running for years before this, creating 50+ events a day.

Huginn log 2.txt Huginn log 1.txt

Unending commented 1 year ago

Logs show that the page returns the following:

rss.indeed.com

Checking if the site connection is secure

rss.indeed.com needs to review the security of your connection before proceeding.

Ray ID: 7996afa69af0828c
Performance & security by Cloudflare

You might want to try the suggestions in issue #2658.

mcanady commented 1 year ago

Thanks! I don't have time to troubleshoot, if someone can let me know some steps to take to resolve, specific to RSSAgent, I'd appreciate it--that's a long thread and I don't know what to take out of it, to try.

For now, I fixed the problem by putting the Indeed RSS feeds into Inoreader folders, and inputting the resulting feeds into RSSAgent. Inoreader is only $9 a month.

virtadpt commented 1 year ago

That is Cloudflare getting in your way. You might have to look intoi something like phantomJS to handle the bot-blocking for you, or find another way around it.

mcanady commented 1 year ago

Thanks--I am not developer and don't really know how to implement phantomJS with Huginn, any advice appreciated.

HumanG33k commented 1 year ago

You can find documentation about it in the wiki :

But for me a better option is to contact indeed to disable their cloudflare protection. You can push my twitter message : https://twitter.com/Logan__GA/status/1626359812190511104

mcanady commented 1 year ago

Thanks so much @HumanG33k !

knu commented 1 year ago

One way to deal with a broken web site is to create a WebsiteAgent with the type "text" and the mode "all" that is periodically run, then a LiquidOutputAgent as a receiver with the mode "Last Event In" and the content written in Liquid Template that makes necessary fixes using regex_replace and so on. Then an RssAgent can subscribe to the fixed feed provided by that agent.

mcanady commented 1 year ago

Thanks @knu ! I don't have much time right now, Inoreader is solving my problem, but maybe I'll try that when I have more time to troubleshoot.

clfsoft commented 10 months ago

should fix by the pull requst + force_encoding https://github.com/huginn/huginn/pull/3336