Closed rochos-foniem closed 5 years ago
hey @rochos-foniem, could you share with me your watcher configuration, and how often it's scheduled? Trying to trick captchas is outside the scope of watchme, but I can take a look and see if I can be of any help.
FYI,
In some popular websites (like Amazon), your task may return None sometimes. I suspect that some websites are equipped with anti-spiders protection (that may return HTTP 403)
Add a User-Agent header to the configuration to bypass it.
[task-bitcoin] url = https://www.cointelegraph.com/bitcoin-price-index func = get_task active = true selection = .price-value get_text = true type = urls header_User-Agent = Mozilla/5.0
Wow, the detection bots are (simple) enough that adding the header resolves the issue?
Should we have this done by default? And then if so, should the user agent be randomly selected (from what choices) or consistent for the lifecycle of a task?
And I'll mention that I hope you are using it within a reasonable limit of whatever kind of rules a website might have. I certainly can't control your use, but I can encourage you to do that :)
@SCHKN we could also update the docs with your example. Let me know.
Thanks @SCHKN and @rochos-foniem, the latest release has added (by default) a User-Agent so many of these previously null values should be populated. I tested it with (my highly useful) pusheen-watcher and you can see the amazon task had an actual value.
Closing issue, please comment further if there are any issues.
Hello, thanks for the project. I've been trying to monitor different pages and I'm getting blocked with some with captcha, for example amazon.com Is it possible to implement something like proxycrawl.com? Or maybe the option to add proxies would be great. In any case thanks for this project!