Closed ahetawal-p closed 2 years ago
It depends on how the web server is setup and how it handles the User-Agent
header. I would recommend trying the request with a User-Agent
header set and if that fails, then retry without a header. (Or find a new user-agent
that works for all of your use cases)
Any ideas how can I get a list of user agents to test with ?
Found another use case for tripadvisor
urls, I am just getting Time out
exception, even after removing UserAgent.
I usually use the user agents listed here -> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent OR https://deviceatlas.com/blog/list-of-user-agent-strings
airbnb
is returning a Response code 403 (Forbidden)
error most of the time... But sometimes the same request works without any changes.
It looks like tripadvisor
is activity blocking scraping attempts. I can't get OGS to resolve the homepage. You might be better off using puppeteer
to scrap these sort of sites. Example: https://dev.to/andreasa/how-to-scrape-tripadvisor-reviews-with-nodejs-and-puppeteer-5gn
If you can get puppeteer
to work, then you could just pass the html
into OGS as the html
option.
Thanks I will try these options out. Just wanted to know if there was some other way to abstract all these out.
I am using below options for making
ogs
requests:But it seems like for AirBnb links, I am getting a Page Not found with the above options. ex: https://www.airbnb.com/rooms/50977610?adults=2&federated_search_id=0e19e98d-0316-4af5-b012-5ce8ac79d193&source_impression_id=p3_1639862342_Jm6ObA%2F4k2zCnOw5&guests=1
While removing the User-Agent header fixes the airbnb problem, it causes other links to break like for example
tiktok
.Any ideas why some websites work with UserAgent while others dont, is there a better way to handle such discrepancies in the library itself ?