paladin3895 / google-search

0 stars 0 forks source link

[Feature] Use desktop version of Google search #3

Open olivierobert opened 1 year ago

olivierobert commented 1 year ago

Issue

While using specific user agents is expected, the current implementation of the scraper seems to use the mobile version of the Google search page (not the header with the 4 tabs)

image

As a result, the number of links and ads is quite different from the desktop version:

Mobile Desktop
image image

Expected

The scraper should use the desktop version of the Google search page:

image

This part of the code should be amended:

https://github.com/paladin3895/google-search/blob/9e652a7fab9f1073318c4bbb09de4fafba9de5cc/app/Utils/Search/GoogleSearch.php#L37-L42

Note that I have also not seen any implementation for the rotation of user agents:

image

paladin3895 commented 1 year ago

Hi Olivier,

The search page HTML was a desktop version of browsers with javascript option turned off (user-agent was set as Internet Explorer 9 in this instance). Because when searching using Chrome user-agent, Google returned an HTML page with no search results but the results will be rendered later by javascript that makes it very hard to scrape the data from the page. So one workaround I made was to send the query as an old browser to disable javascript and get the results in raw HTML.

I have checked the results thoroughly and see that the search results between these 2 versions were identical. Please note that this query document.querySelectorAll('a').length would return all the a tags that were not related to the search results (some were UI components like buttons, previous search links, feedback links, etc)

Hope this explains it.

paladin3895 commented 1 year ago

Your other question about the IP addresses and user-agent rotation was noted. I'll update the source code and the demo server along with other improvements tomorrow.

olivierobert commented 1 year ago

Well noted on the issue you faced with JS but there might then be an issue with the Guzzle HTTP client. You could try this user agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36" as they have been tried and validated to be working:

As for the querying all the links, it is indeed expected from the requirements (so not only search result links):

image
paladin3895 commented 1 year ago

Indeed, there were discrepancies between javascript and non-javascript versions of the search page. I just found a solution to this problem: basically we can get the search page with javascript and then run a headless Chrome to render HTML for us to scrape the data. I'll push a feature branch tomorrow morning to address this.

paladin3895 commented 1 year ago

Please refer to this commit #4014