Closed stuartathompson closed 2 years ago
This has to do with headless browsing and DuckDuckGo.
For some odd reason, Duckduckgo serves different results to the browser when you're in headless mode than when showing a real broswer.
I had to do my own scraper with Puppeteer to get the "correct" results. Maybe consider a flag for using headless: false
?
If it works with Puppeteer, it's probably related to Js - Puppeteer is an emulator, while this repo uses a plain HTTP client. Browser emulation is heavy on resources and so I decided to use Python's requests
lib instead. You could try setting a user-agent with .set_headers()
, but I doubt it will help.
The issue is that DuckDuckGo serves phony results to headless browsers. You have to use a visible browser to get correct results or find another way to do it. I ended up writing my own scraper for DDG.
I think a headless/non-headless mode would be really valuable for this library.
So, the difference you noticed is because we're using the no-js version of Duckduckgo (https://html.duckduckgo.com/html/), while the regular results are fetched from a js file (https://links.duckduckgo.com/d.js?q=test&t=D&l=us-en&s=0&a=h_&dl=en&ct=GR&ss_mkt=us&vqd=3-271302360671697817199458226164694755283-142931334088488469610097276646263969243&p_ent=&ex=-1&sp=0). Of course, we can get this file without an emulator and parse the results out of the js code; the only problem is the vqd=
parameter, but I'll see if I can reverse engineer it.
I changed Duckduckgo to use the js version, and now the results should be similar to those we get from a browser. I'll keep this issue open for a while, in case there are any bugs.
If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?
Thanks!