Getting blocked by websites

vsoch / watchme

Reproducible watchers for research

https://vsoch.github.io/watchme/

Mozilla Public License 2.0

851 stars 32 forks source link

Getting blocked by websites #25

Closed rochos-foniem closed 5 years ago

rochos-foniem commented 5 years ago

Hello, thanks for the project. I've been trying to monitor different pages and I'm getting blocked with some with captcha, for example amazon.com Is it possible to implement something like proxycrawl.com? Or maybe the option to add proxies would be great. In any case thanks for this project!

vsoch commented 5 years ago

hey @rochos-foniem, could you share with me your watcher configuration, and how often it's scheduled? Trying to trick captchas is outside the scope of watchme, but I can take a look and see if I can be of any help.

SCHKN commented 5 years ago

FYI,

In some popular websites (like Amazon), your task may return None sometimes. I suspect that some websites are equipped with anti-spiders protection (that may return HTTP 403)

Add a User-Agent header to the configuration to bypass it.

[task-bitcoin] url = https://www.cointelegraph.com/bitcoin-price-index func = get_task active = true selection = .price-value get_text = true type = urls header_User-Agent = Mozilla/5.0

vsoch commented 5 years ago

Wow, the detection bots are (simple) enough that adding the header resolves the issue?

Should we have this done by default? And then if so, should the user agent be randomly selected (from what choices) or consistent for the lifecycle of a task?

vsoch commented 5 years ago

And I'll mention that I hope you are using it within a reasonable limit of whatever kind of rules a website might have. I certainly can't control your use, but I can encourage you to do that :)

vsoch commented 5 years ago

@SCHKN we could also update the docs with your example. Let me know.

vsoch commented 5 years ago

Thanks @SCHKN and @rochos-foniem, the latest release has added (by default) a User-Agent so many of these previously null values should be populated. I tested it with (my highly useful) pusheen-watcher and you can see the amazon task had an actual value.

vsoch commented 5 years ago

Closing issue, please comment further if there are any issues.