Open sentinel3 opened 2 years ago
Yes, if the pages applies a Captcha to avoid automated tools to grab the page contents, there's not much we can do. Have you checked if Walmart or Home Depot provides an API for grabbing pricing information?
Maybe it's possible to use an API for that purpose?
Thank you for the quick response!
I believe the human detection is only for the walmart.com case. the HomeDepot.ca
has connection error 443
, which I am not sure whether is caused by the same reason.
I previously build some similar small project use requests-html and BS4 + headless browser, and I did not encounter the human/robot detection on most commercial sites, maybe I will go back and give walmart.ca
a try.
Any way thank you for the suggestion on the Walmart API.
Have you tried using https://urlwatch.readthedocs.io/en/latest/jobs.html#browser (just change url
to navigate
) which uses a headless variant of the Chrome browser to load the page? Maybe this is "good enough" to make it not trigger the Captcha.
Hi great tool! I just set
urlwatch
up and it works on one test page. but when I further explore and put into one product on HomeDepot and Walmart, they both failed.then I tried to test the filters for Homedepot list:
urlwatch --test-filter
and both give me empty result. I further commented thexpath
infilter
in the above settings. and run withverbose
:for this Homedepot list, it hangs here! I further followed the #575 to add:
##headers: User-Agent: <redacted>
, then it runs but still no result.then I tried to test the filters for Walmart list:
I tried suggestion of #465, and tried to download the walmart page directly:
curl
directly repliesblocked
, whilewget
will download a page with text:seems Walmart blocks such automatic watch?
any help is appreciated. thx