known issue [1]: The more keywords you have in the CSV file, the longer it takes. Using thread (or sidekiq library) might help solving this issue.
known issue [2]: mass-searching keywords is blocked. (possible solutions: user-agent rotation or in this interesting article "Web Scraping Google Without Getting Blocked")
When the request get blocked, we will see this page:
How searching for each keyword works
For each keyword, I sequentially send a get request to http://www.google.com/search?q=#{keyword}&hl=en
(use hl=en to get the result in English)
NOTE: this can be improved by sending multiple requests simultaneously.
How extracting data from HTML works
I used nokogiri css selector to extract data from the HTML document.
total_search_results
selector: div#result-stats
total_links
I count <a> tag in the page
total_ads
selector: div#tads[aria-label=Ads] (if not found, fallback to div#tads)
Then return element.childrent.count
What happened π
google_search
to search each keyword and store the result in the databaseInsight π
How searching for each keyword works
For each keyword, I sequentially send a get request to
http://www.google.com/search?q=#{keyword}&hl=en
(usehl=en
to get the result in English) NOTE: this can be improved by sending multiple requests simultaneously.How extracting data from HTML works
I used
nokogiri
css selector to extract data from the HTML document.total_search_results
div#result-stats
total_links
I count<a>
tag in the pagetotal_ads
selector:div#tads[aria-label=Ads]
(if not found, fallback todiv#tads
) Then returnelement.childrent.count
Proof Of Work πΉ