thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

Document how changed pages are detected #680

Closed knutwannheden closed 2 years ago

knutwannheden commented 2 years ago

I read the documentation rather carefully, but I might still have overlooked it. I am looking for a description of how urlwatch determines that a web page has changed. Looking into the sqlite database I notice that there is support for ETags as well as a timestamp column.

The reason I am asking is that there are webpages with and without ETags and there are also web pages which will serve slightly different HTML for every single request. I would like to understand how urlwatch deals with the different scenarios. Thanks!

thp commented 2 years ago

It uses If-Modified-Since (timestamp) and If-None-Match (etag) for conditional requests. The conditional requests are just an "optimization" (server doesn't need to send the document, urlwatch assumes it has not changed). If etag/timestamp are not available, urlwatch will request the document and compare it to the previous version.

In cases where the webpage will serve slightly different HTML, urlwatch by default will detect this as change every time. However, since this can be a common issue, urlwatch has "filters" which can be used to filter out the always-changing parts, or alternatively, to filter only the part that's interesting (e.g. a div with a certain ID, or whatever -- depending on the page).