Page not found when using User-Agent header

jshemas / openGraphScraper

Node.js scraper service for Open Graph Info and More!

MIT License

669 stars 105 forks source link

Page not found when using User-Agent header #142

Closed ahetawal-p closed 2 years ago

ahetawal-p commented 2 years ago

I am using below options for making ogs requests:

const options = {
    url: inputUrl,
    headers: {
      'user-agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    }
  };

But it seems like for AirBnb links, I am getting a Page Not found with the above options. ex: https://www.airbnb.com/rooms/50977610?adults=2&federated_search_id=0e19e98d-0316-4af5-b012-5ce8ac79d193&source_impression_id=p3_1639862342_Jm6ObA%2F4k2zCnOw5&guests=1

While removing the User-Agent header fixes the airbnb problem, it causes other links to break like for example tiktok.

Any ideas why some websites work with UserAgent while others dont, is there a better way to handle such discrepancies in the library itself ?

jshemas commented 2 years ago

It depends on how the web server is setup and how it handles the User-Agent header. I would recommend trying the request with a User-Agent header set and if that fails, then retry without a header. (Or find a new user-agent that works for all of your use cases)

ahetawal-p commented 2 years ago

Any ideas how can I get a list of user agents to test with ? Found another use case for tripadvisor urls, I am just getting Time out exception, even after removing UserAgent.

https://www.tripadvisor.com/Attraction_Review-g147416-d147523-Reviews-Queen_s_Staircase-Nassau_New_Providence_Island_Bahamas.html

jshemas commented 2 years ago

I usually use the user agents listed here -> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent OR https://deviceatlas.com/blog/list-of-user-agent-strings

airbnb is returning a Response code 403 (Forbidden) error most of the time... But sometimes the same request works without any changes.

It looks like tripadvisor is activity blocking scraping attempts. I can't get OGS to resolve the homepage. You might be better off using puppeteer to scrap these sort of sites. Example: https://dev.to/andreasa/how-to-scrape-tripadvisor-reviews-with-nodejs-and-puppeteer-5gn

If you can get puppeteer to work, then you could just pass the html into OGS as the html option.

ahetawal-p commented 2 years ago

Thanks I will try these options out. Just wanted to know if there was some other way to abstract all these out.