thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

It's possible to avoid "403 Client Error: Forbidden for url" with url function? #765

Open Jorman opened 1 year ago

Jorman commented 1 year ago

Hi, I'm trying to configure a very simple url, is a simple site where you find used items, no log-in required, and nothing special is needed. If I try a simple wget with the url, it downloads the page, but if I use the url in the urlwatch configuration it returns error 403 Client Error: Forbidden for url The only way I found to not get error is to configure urlwatch with "navigate" instead of "url", but of course it is much slower. Is there any way to understand why "url" mode doesn't work with this site?

If you want to try this is my configuration:

name: "Test"
navigate: "https://www.subito.it/annunci-italia/vendita/auto/suzuki/?q=suzuki+vitara"
filter:
  - css:
      selector: 'div.ItemListContainer_container__SjEc1 > p'
diff_filter:
  - grep: '^[@+]'

Any ideas?

thp commented 1 year ago

It depends on what the server does, e.g. maybe it checks user-agent or some other headers (it's probably not your IP address if wget works). You can override the user-agent header.

trevorshannon commented 6 months ago

Sometimes you can also check the Google cache instead of the url directly. You can't do this very often (seems like a few times per day is ok) and it will not always be as immediately up-to-date as the direct url, but can help.

https://webcache.googleusercontent.com/search?q=cache:https://www.subito.it/annunci-italia/vendita/auto/suzuki/