Open olivierobert opened 1 year ago
Hi Olivier,
The search page HTML was a desktop version of browsers with javascript option turned off (user-agent was set as Internet Explorer 9 in this instance). Because when searching using Chrome user-agent, Google returned an HTML page with no search results but the results will be rendered later by javascript that makes it very hard to scrape the data from the page. So one workaround I made was to send the query as an old browser to disable javascript and get the results in raw HTML.
I have checked the results thoroughly and see that the search results between these 2 versions were identical. Please note that this query document.querySelectorAll('a').length
would return all the a
tags that were not related to the search results (some were UI components like buttons, previous search links, feedback links, etc)
Hope this explains it.
Your other question about the IP addresses and user-agent rotation was noted. I'll update the source code and the demo server along with other improvements tomorrow.
Well noted on the issue you faced with JS but there might then be an issue with the Guzzle HTTP client. You could try this user agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
as they have been tried and validated to be working:
As for the querying all the links, it is indeed expected from the requirements (so not only search result links):
Indeed, there were discrepancies between javascript and non-javascript versions of the search page. I just found a solution to this problem: basically we can get the search page with javascript and then run a headless Chrome to render HTML for us to scrape the data. I'll push a feature branch tomorrow morning to address this.
Please refer to this commit #4014
Issue
While using specific user agents is expected, the current implementation of the scraper seems to use the mobile version of the Google search page (not the header with the 4 tabs)
As a result, the number of links and ads is quite different from the desktop version:
Expected
The scraper should use the desktop version of the Google search page:
This part of the code should be amended:
https://github.com/paladin3895/google-search/blob/9e652a7fab9f1073318c4bbb09de4fafba9de5cc/app/Utils/Search/GoogleSearch.php#L37-L42
Note that I have also not seen any implementation for the rotation of user agents: