Closed dccmmtop closed 5 years ago
The problem is not in in_parallel
method. The same error you can get processing urls one by one:
require 'kimurai'
class AmazonSpider < Kimurai::Base
@name = "amazon_spider"
@engine = :mechanize
@start_urls = ["https://www.amazon.com/"]
def parse(response, url:, data: {})
browser.fill_in "field-keywords", with: "Web Scraping Books"
browser.click_on "Go"
urls = []
response = browser.current_response
response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
url = a[:href].sub(/ref=.+/, "")
puts url
urls << url
end
# Process all collected urls concurrently within 3 threads:
# in_parallel(:parse_book_page, urls, threads: 3)
# Process urls one by one:
urls.each do |url|
request_to :parse_book_page, url: url
end
end
def parse_book_page(response, url:, data: {})
# ...
end
end
AmazonSpider.crawl!
Output:
$ ruby amazon_spider.rb
I, [2019-01-18 14:06:25 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Spider: started: amazon_spider
D, [2019-01-18 14:06:26 +0400#5875] [M: 47354647381520] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-18 14:06:26 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2019-01-18 14:06:27 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2019-01-18 14:06:27 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
/gp/slredirect/picassoRedirect.html/
/gp/slredirect/picassoRedirect.html/
https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
https://www.amazon.com/Python-Web-Scraping-Cookbook-microservices-ebook/dp/B077NC4TQP/
https://www.amazon.com/Java-Scraping-Handbook-Kevin-Sahin-ebook/dp/B07MKX7SVM/
https://www.amazon.com/Scraping-Excel-David-M-W-Phillips/dp/1522940626/
https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
https://www.amazon.com/Python-Web-Scraping-Hands-scraping/dp/1786462583/
https://www.amazon.com/Web-Scraping-Quick-Start-Guide/dp/1789138736/
https://www.amazon.com/Python-Everybody-Exploring-Data-ebook/dp/B01IA5VIFM/
https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/
https://www.amazon.com/Instant-Scraping-Java-Ryan-Mitchell/dp/1849696888/
https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X/
https://www.amazon.com/Website-Scraping-Python-BeautifulSoup-Scrapy/dp/1484239245/
https://www.amazon.com/Learning-Scrapy-Dimitrios-Kouzis-Loukas/dp/1784399787/
https://www.amazon.com/Web-Robots-Automation-Web-Scraping-Web-marketing-ebook/dp/B07HKT13KC/
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Browser: started get request to: /gp/slredirect/picassoRedirect.html/
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Info: visits: requests: 2, responses: 1
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
F, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520] FATAL -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:failed, :error=>"#<NoMethodError: undefined method `call' for \"app\":String>", :environment=>"development", :start_time=>2019-01-18 14:06:25 +0400, :stop_time=>2019-01-18 14:06:29 +0400, :running_time=>"3s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
Traceback (most recent call last):
20: from ruby_kimu.rb:48:in `<main>'
19: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:122:in `crawl!'
18: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:122:in `each'
17: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:126:in `block in crawl!'
16: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:200:in `request_to'
15: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:200:in `public_send'
14: from ruby_kimu.rb:31:in `parse'
13: from ruby_kimu.rb:31:in `each'
12: from ruby_kimu.rb:32:in `block in parse'
11: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:197:in `request_to'
10: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/capybara_ext/session.rb:21:in `visit'
9: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
8: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
7: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
6: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
5: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
4: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
3: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
2: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
1: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)
As you can see, error caused because of the first url in the urls
array, which is not a correct url actually (/gp/slredirect/picassoRedirect.html/
). When I first time wrote this example scraper, all listings urls from the response were correct ones, but it looks like amazon changed some things since then.
So, Mechanize doesn't handle this situation properly and instead just throwing non informative error like 'undefined method call' for "app":String (NoMethodError)'. For example other engines like Selenium, in this case raise error like
unknown error: unhandled inspector error: {"code":-32000,"message":"Cannot navigate to invalid URL"}` which is much better.
I recommend you to use absolute_url
helper which will take care about relative urls and make them absolute (it is also doesn't broke urls which is already absolute and fine). Example:
def parse(response, url:, data: {})
browser.fill_in "field-keywords", with: "Web Scraping Books"
browser.click_on "Go"
urls = []
sleep 3
response = browser.current_response
response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
listing_url = absolute_url(a[:href].sub(/ref=.+/, ""), base: url) # use absolute url
puts listing_url
urls << listing_url
end
urls.each do |listing_url|
request_to :parse_book_page, url: listing_url
end
end
Another possible way is to skip such incorrect urls manually by regex or something like that.
Also probably it is a good idea add urls checking to the request_to
method, and throw an error if url is not absolute or not a url at all. I will do that.
Added url validation for Base#request_to
method (master https://github.com/vifreefly/kimuraframework/commit/1f682b5d9235f0fa598d53c16e5ebc58283eea5e)
Now it raises error like /home/victor/code/kimurai/lib/kimurai/base.rb:194:in
request_to': Requested url is invalid: /gp/slredirect/picassoRedirect.html/ (Kimurai::Base::InvalidUrlError)
`
@vifreefly You're very efficient. Thank you. I didn't check the validity of links.
An error occurred when I used the
in_parallel
methodthis is example
this is error info