huginn / huginn

Create agents that monitor and act on your behalf. Your agents are standing by!
MIT License
42.7k stars 3.74k forks source link

Alert if the anything new is added to the website #2642

Open s1782662 opened 4 years ago

s1782662 commented 4 years ago

I have gone through Website Watcher, which facilitates in alerting me if there is any update on a particular website. The update can be adding or deleting or modifying the article etc.

I don't want to extract or scrape the content. Instead, I wish to be notified saying something has been updated. I have gone through the examples, however i couldn't understand it. I wanted to know if this scenario can be achieved through huginn.

A small example of the above scenario would be greatly helpful.

Thanks

cantino commented 4 years ago

Yes, definitely. You could also look at diffbot or https://distill.io/

dsander commented 4 years ago

If you just want to know that something changed (without getting the information what changed) the WebsiteAgent will work:

{
  "expected_update_period_in_days": "2",
  "url": "https://xkcd.com",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "body_html": {
      "css": "body",
      "value": "."
    },
    "text_content": {
      "css": "body",
      "value": "normalize-space(.)"
    }
  }
}

I'd recommend using either body_html or text_content, for former would create an event as soon as the HTML structure changes, the latter only if the text inside the tags changed.

There is also the third party readbility_agent which has finer grained text extraction features, it can be combined with a ChangeDetectorAgent.

s1782662 commented 4 years ago

Thanks for the response. I am just making it clear, suppose if a new article is added into website, the application will notify me there is a change in the website.

In addition, if a page contains a newly added a link to pdf docuument, even that gets notified isn't it ?

urbanadventurer commented 4 years ago

@s1782662 this example checks only a single web page for changes. If the change you want to detect is not on this webpage it will not be detected.

If you want to monitor a website for new articles, does the website you are interested in have an article index to monitor? If so, then the article index would change and @dsander's example above will work for you.

If you want to monitor changes across an entire website, then you will need a different type of solution. One option would be to make Website Agents to monitor each webpage individually. Another option would be to use a third-party service that offers an API then make a Huginn Website Agent to monitor this API for updates.

Here is a list of other tools that you might have an API you can monitor with Huginn. https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring