Results appear different than the browser

tasos-py / Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python

MIT License

513 stars 137 forks source link

Results appear different than the browser #41

Closed stuartathompson closed 2 years ago

stuartathompson commented 2 years ago

If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?

Thanks!

stuartathompson commented 2 years ago

This has to do with headless browsing and DuckDuckGo.

For some odd reason, Duckduckgo serves different results to the browser when you're in headless mode than when showing a real broswer.

I had to do my own scraper with Puppeteer to get the "correct" results. Maybe consider a flag for using headless: false?

tasos-py commented 2 years ago

If it works with Puppeteer, it's probably related to Js - Puppeteer is an emulator, while this repo uses a plain HTTP client. Browser emulation is heavy on resources and so I decided to use Python's requests lib instead. You could try setting a user-agent with .set_headers(), but I doubt it will help.

stuartathompson commented 2 years ago

The issue is that DuckDuckGo serves phony results to headless browsers. You have to use a visible browser to get correct results or find another way to do it. I ended up writing my own scraper for DDG.

I think a headless/non-headless mode would be really valuable for this library.

tasos-py commented 2 years ago

So, the difference you noticed is because we're using the no-js version of Duckduckgo (https://html.duckduckgo.com/html/), while the regular results are fetched from a js file (https://links.duckduckgo.com/d.js?q=test&t=D&l=us-en&s=0&a=h_&dl=en&ct=GR&ss_mkt=us&vqd=3-271302360671697817199458226164694755283-142931334088488469610097276646263969243&p_ent=&ex=-1&sp=0). Of course, we can get this file without an emulator and parse the results out of the js code; the only problem is the vqd= parameter, but I'll see if I can reverse engineer it.

tasos-py commented 2 years ago

I changed Duckduckgo to use the js version, and now the results should be similar to those we get from a browser. I'll keep this issue open for a while, in case there are any bugs.