thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

not able to correctly load page for homedepot and walmart #705

Open sentinel3 opened 2 years ago

sentinel3 commented 2 years ago

Hi great tool! I just set urlwatch up and it works on one test page. but when I further explore and put into one product on HomeDepot and Walmart, they both failed.

$cat urls.yaml
name: "HomeDepot"
url: "https://www.homedepot.ca/product/hampton-bay-1-person-braided-woven-egg-patio-swing/1001582001"
filter:
    - xpath: //span[@class="hdca-product__description-pricing-price-value"]
    - html2text
---
name: "Walmart"
url: "https://www.walmart.ca/en/ip/hometrends-egg-swing-with-stand-black/6000203713927"
filter:
  - xpath: //span/span[@class="css-2vqe5n esdkp3p0" and @data-automation="buybox-price"] 
  - html2text

then I tried to test the filters for Homedepot list: urlwatch --test-filter and both give me empty result. I further commented the xpath in filter in the above settings. and run with verbose:

$urlwatch --test-filter 1 --verbose #xpath filter commented
...
...connectionpool DEBUG: Starting new HTTPS connection (1): www.homedepot.ca:443

for this Homedepot list, it hangs here! I further followed the #575 to add: ##headers: User-Agent: <redacted>, then it runs but still no result.

then I tried to test the filters for Walmart list:

$urlwatch --test-filter 2 --verbose #xpath filter commented
Skip to main ...
JavaScript is Disabled
      Sorry, this webpage requires JavaScript to function correctly.
      Please enable JavaScript in your browser and reload the page.

I tried suggestion of #465, and tried to download the walmart page directly: curl directly replies blocked, while wget will download a page with text:

Are you human?
Seems like a silly question, we know. But, we want to keep robots off of Walmart.ca! 

seems Walmart blocks such automatic watch?

any help is appreciated. thx

thp commented 2 years ago

Yes, if the pages applies a Captcha to avoid automated tools to grab the page contents, there's not much we can do. Have you checked if Walmart or Home Depot provides an API for grabbing pricing information?

Maybe it's possible to use an API for that purpose?

https://developer.walmart.com

sentinel3 commented 2 years ago

Thank you for the quick response! I believe the human detection is only for the walmart.com case. the HomeDepot.ca has connection error 443, which I am not sure whether is caused by the same reason. I previously build some similar small project use requests-html and BS4 + headless browser, and I did not encounter the human/robot detection on most commercial sites, maybe I will go back and give walmart.ca a try. Any way thank you for the suggestion on the Walmart API.

thp commented 2 years ago

Have you tried using https://urlwatch.readthedocs.io/en/latest/jobs.html#browser (just change url to navigate) which uses a headless variant of the Chrome browser to load the page? Maybe this is "good enough" to make it not trigger the Captcha.