mborsetti / webchanges

webchanges anonymously checks web content (including images) and commands for changes, delivering instant notifications and AI-powered summaries to your favorite platform.
https://pypi.org/project/webchanges/
Other
37 stars 6 forks source link

Protected sites #17

Closed powidlt closed 3 years ago

powidlt commented 3 years ago

Hey!

I'm using your fork for my morning updates but more and more sites protect themselves with eg. Cloudflare. I found no way to check for changes even with navigate / pyppeteer the access is denied. Sending custom headers did not solve the issue as well.

Is there even a way to access or am I wasting my time?

mborsetti commented 3 years ago

Cloudflare etc. have multiple ways of blocking, some of which are IP-address based and cannot be circumvented.

But assuming that it's not an IP address issue and that you're not being blocked when browsing with Chrome on the same machine, then a combination of headers and the use of local storage for authentication as described here should do the trick.

Please let me know how it works out for you (and if so if documentation could be improved -- your help would be wonderful).

powidlt commented 3 years ago

I tried a combination of headers and user_data_dir but I still receive errors:

Traceback (most recent call last):
  File "C:\Users\un\AppData\Local\Programs\Python\Python39\lib\site-packages\webchanges\handler.py", line 133, in process
    data, self.new_etag = self.job.retrieve(self)
  File "C:\Users\un\AppData\Local\Programs\Python\Python39\lib\site-packages\webchanges\jobs.py", line 749, in retrieve
    response, etag = asyncio.run(self._retrieve())
  File "C:\Users\un\AppData\Local\Programs\Python\Python39\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\un\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete
    return future.result()
  File "C:\Users\un\AppData\Local\Programs\Python\Python39\lib\site-packages\webchanges\jobs.py", line 1030, in _retrieve
    raise BrowserResponseError(('',), response_code)
webchanges.jobs.BrowserResponseError: BrowserResponseError: **Received response HTTP 429 Too Many Requests**

I can see that Chromium is accessing my user data folder but it's not working.

Here are my configs:

jobs.yaml

name: Test1234
url: myurl
use_browser: true
user_data_dir: C:\chromium

config.yaml


display:
  new: true
  error: true
  unchanged: false

job_defaults:
  browser:
    chromium_revision:
      win64: 843846
    switches:
      - --enable-experimental-web-platform-features

  all:
    headers:
      Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
      Accept-Language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7
      Upgrade-Insecure-Requests: '1'
      User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.0 Safari/537.36
      Viewport-Width: '1707'
      X-Http-Proto:  HTTP/1.1

The header is copied from http://myhttpheader.com/ using Chromium

mborsetti commented 3 years ago

Is that the only job you're running?

I searched For Cloudflare HTTP response status code 429 and the first hit was https://support.cloudflare.com/hc/en-us/articles/115001635128-Configuring-Cloudflare-Rate-Limiting, which says the following:

Once an individual IPv4 address or IPv6 /64 IP range exceeds a rule threshold, further requests to the origin web server are blocked with an HTTP 429 response that includes a Retry-After header to indicate when the client can resume sending requests.

Also, try running webchanges with -v or --verbose (same thing) to see if you get even more information (I doubt it, but worth a shot).

powidlt commented 3 years ago

Is that the only job you're running? On this system yes, tried it with several IP addresses but after the first shots I get 429 already...

I tried it with urlwatch with the same parameters and I get a different response with a lot more details. At first glance it seems that it's working because I can see many green lines but on the very bottom I find:

+  <h1>Access denied</h1>
+  <p>This website is using a security service to protect itself from online attacks.</p>
mborsetti commented 3 years ago

On this system yes, tried it with several IP addresses but after the first shots I get 429 already...

So it works for a while and then "after the first shots" you're getting rate limited? I presume you're hitting the same website with loads of jobs, correct? Can you tell me more about the series of jobs you have?

This project was built with speed in mind, so it will parallelize the requests as much as possible, which means that if multiple jobs are to the same site it may get hit with more requests per second than what the rate limit rules say it's allowed, hence the 429.

I remember someone asking a while back in urlwatch for some form of job delay (to slow things down), which was never developed, and this may be the perfect use case for it. If you're willing to beta test it for me, I may have some time later this week or this weekend to write some code.

powidlt commented 3 years ago

Actually I have never seen content from the page, access was denied after I hit enter or maybe after the second enter. I didn't know -v so I cannot confirm when I was banned.

I presume you're hitting the same website with loads of jobs, correct? Can you tell me more about the series of jobs you have?

No, just one query...

I remember someone asking a while back in urlwatch for some form of job delay (to slow things down), which was never developed, and this may be the perfect use case for it. If you're willing to beta test it for me, I may have some time later this week or this weekend to write some code.

Sure, why not! To be honest, my request is not very important so please take your time.

mborsetti commented 3 years ago

Actually I have never seen content from the page, access was denied after I hit enter or maybe after the second enter. I didn't know -v so I cannot confirm when I was banned.

I hypothesize that your "first shots" flagged your "individual IPv4 address or IPv6 /64 IP range" as one that "exceeds a rule threshold", and so at this point you're being blocked immediately. No amount of delay to slow down things will help. Sorry.