tasos-py / Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python
MIT License
513 stars 137 forks source link

Different results between the search engine scraper and google #57

Closed jxpeng98 closed 10 months ago

jxpeng98 commented 1 year ago

Hello,

Thanks for the outstanding library!

I recently faced an issue with different results when using the scraper.

Search query is PROCTER & GAMBLE CO sustainability report. From google web query, I can get the results as following:

image

However, when I use scraper,

from search_engines import Google
from googlesearch import search

query = 'PROCTER & GAMBLE CO sustainability report'
results = engine.search(query, 1)
links = results.links()

The output links are:

https://us.pg.com/ 
https://twitter.com/ProcterGamble?ref_src=twsrc^google|twcamp^serp|twgr^author 
https://en.wikipedia.org/wiki/Procter_&_Gamble 
https://www.pgcareers.com/ 
https://www.linkedin.com/company/procter-and-gamble 
https://www.facebook.com/proctergamble/ 
https://pginvestor.com/ 

May I know why this happens? How can I get the consistent result?

Many thanks!

jxpeng98 commented 1 year ago

I find the issue.

It is due to & in the query. If I change the query to PROCTER and GAMBLE CO sustainability report. The output will be:

https://us.pg.com/sustainability-reports/ 
https://www.pg.co.uk/environmental-sustainability/ 
https://www.sustainability-reports.com/company/procter-gamble-nederland-bv/ 
https://www.responsibilityreports.com/Company/procter-gamble-co 
https://www.pginvestor.com/esg/esg-overview/ 
https://www.knowesg.com/esg-ratings/the-procter-and-gamble-company 
https://assets.ctfassets.net/oggad6svuzkv/6BTnYGZ9raiy4is806wCkI/dfb3ae4d8c1304f24ece241f643aed7f/2010_Full_Sustainability_Report.pdf 

Is there any way to solve this problem except change the character?

tasos-py commented 1 year ago

First of all, thanks for all the details. You're right, the & character changes the query from "PROCTER & GAMBLE CO sustainability report" to "PROCTER ", and so we get wrong results. I've added URL-encoding to the query, which should fix this issue.