vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites
MIT License
1.01k stars 155 forks source link

in_parallel: undefined method `call' for "app":String (NoMethodError) #19

Closed dccmmtop closed 5 years ago

dccmmtop commented 5 years ago

An error occurred when I used the in_parallel method

this is example

# amazon_spider.rb
require 'kimurai'

class AmazonSpider < Kimurai::Base
  @name = "amazon_spider"
  @engine = :mechanize
  @start_urls = ["https://www.amazon.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    # Walk through pagination and collect products urls:
    urls = []
    loop do
      response = browser.current_response
      response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
        urls << a[:href].sub(/ref=.+/, "")
      end

      browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
    end

    # Process all collected urls concurrently within 3 threads:
    in_parallel(:parse_book_page, urls, threads: 3)
  end

  def parse_book_page(response, url:, data: {})
    item = {}

    item[:title] = response.xpath("//h1/span[@id]").text.squish
    item[:url] = url
    item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
    item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence

    save_to "books.json", item, format: :pretty_json
  end
end

AmazonSpider.crawl!

this is error info

I, [2019-01-17 10:25:33 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: started: amazon_spider
D, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: in_parallel: starting processing 63 urls within 3 threads
D, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: started get request to: /gp/slredirect/picassoRedirect.html/
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Info: visits: requests: 2, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
#<Thread:0x0000561c4db3dd18@/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:295 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
F, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960] FATAL -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:failed, :error=>"#<NoMethodError: undefined method `call' for \"app\":String>", :environment=>"development", :start_time=>2019-01-17 10:25:33 +0800, :stop_time=>2019-01-17 10:25:48 +0800, :running_time=>"15s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)
vifreefly commented 5 years ago

The problem is not in in_parallel method. The same error you can get processing urls one by one:

require 'kimurai'

class AmazonSpider < Kimurai::Base
  @name = "amazon_spider"
  @engine = :mechanize
  @start_urls = ["https://www.amazon.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    urls = []
    response = browser.current_response

    response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
      url = a[:href].sub(/ref=.+/, "")
      puts url
      urls << url
    end

    # Process all collected urls concurrently within 3 threads:
    # in_parallel(:parse_book_page, urls, threads: 3)

    # Process urls one by one:
    urls.each do |url|
      request_to :parse_book_page, url: url
    end
  end

  def parse_book_page(response, url:, data: {})
    # ...
  end
end

AmazonSpider.crawl!

Output:

$ ruby amazon_spider.rb
I, [2019-01-18 14:06:25 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Spider: started: amazon_spider
D, [2019-01-18 14:06:26 +0400#5875] [M: 47354647381520] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-18 14:06:26 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2019-01-18 14:06:27 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2019-01-18 14:06:27 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
/gp/slredirect/picassoRedirect.html/
/gp/slredirect/picassoRedirect.html/
https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
https://www.amazon.com/Python-Web-Scraping-Cookbook-microservices-ebook/dp/B077NC4TQP/
https://www.amazon.com/Java-Scraping-Handbook-Kevin-Sahin-ebook/dp/B07MKX7SVM/
https://www.amazon.com/Scraping-Excel-David-M-W-Phillips/dp/1522940626/
https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
https://www.amazon.com/Python-Web-Scraping-Hands-scraping/dp/1786462583/
https://www.amazon.com/Web-Scraping-Quick-Start-Guide/dp/1789138736/
https://www.amazon.com/Python-Everybody-Exploring-Data-ebook/dp/B01IA5VIFM/
https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/
https://www.amazon.com/Instant-Scraping-Java-Ryan-Mitchell/dp/1849696888/
https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X/
https://www.amazon.com/Website-Scraping-Python-BeautifulSoup-Scrapy/dp/1484239245/
https://www.amazon.com/Learning-Scrapy-Dimitrios-Kouzis-Loukas/dp/1784399787/
https://www.amazon.com/Web-Robots-Automation-Web-Scraping-Web-marketing-ebook/dp/B07HKT13KC/
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Browser: started get request to: /gp/slredirect/picassoRedirect.html/
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Info: visits: requests: 2, responses: 1
I, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
F, [2019-01-18 14:06:29 +0400#5875] [M: 47354647381520] FATAL -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:failed, :error=>"#<NoMethodError: undefined method `call' for \"app\":String>", :environment=>"development", :start_time=>2019-01-18 14:06:25 +0400, :stop_time=>2019-01-18 14:06:29 +0400, :running_time=>"3s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
Traceback (most recent call last):
        20: from ruby_kimu.rb:48:in `<main>'
        19: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:122:in `crawl!'
        18: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:122:in `each'
        17: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:126:in `block in crawl!'
        16: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:200:in `request_to'
        15: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:200:in `public_send'
        14: from ruby_kimu.rb:31:in `parse'
        13: from ruby_kimu.rb:31:in `each'
        12: from ruby_kimu.rb:32:in `block in parse'
        11: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/bob/code/kimu_app/new/kimurai/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/bob/.rbenv/versions/2.5.3/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)

As you can see, error caused because of the first url in the urls array, which is not a correct url actually (/gp/slredirect/picassoRedirect.html/). When I first time wrote this example scraper, all listings urls from the response were correct ones, but it looks like amazon changed some things since then.

So, Mechanize doesn't handle this situation properly and instead just throwing non informative error like 'undefined method call' for "app":String (NoMethodError)'. For example other engines like Selenium, in this case raise error likeunknown error: unhandled inspector error: {"code":-32000,"message":"Cannot navigate to invalid URL"}` which is much better.

I recommend you to use absolute_url helper which will take care about relative urls and make them absolute (it is also doesn't broke urls which is already absolute and fine). Example:

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    urls = []
    sleep 3
    response = browser.current_response

    response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
      listing_url = absolute_url(a[:href].sub(/ref=.+/, ""), base: url) # use absolute url
      puts listing_url
      urls << listing_url
    end

    urls.each do |listing_url|
      request_to :parse_book_page, url: listing_url
    end
  end

Another possible way is to skip such incorrect urls manually by regex or something like that. Also probably it is a good idea add urls checking to the request_to method, and throw an error if url is not absolute or not a url at all. I will do that.

vifreefly commented 5 years ago

Added url validation for Base#request_to method (master https://github.com/vifreefly/kimuraframework/commit/1f682b5d9235f0fa598d53c16e5ebc58283eea5e)

Now it raises error like /home/victor/code/kimurai/lib/kimurai/base.rb:194:inrequest_to': Requested url is invalid: /gp/slredirect/picassoRedirect.html/ (Kimurai::Base::InvalidUrlError) `

dccmmtop commented 5 years ago

@vifreefly You're very efficient. Thank you. I didn't check the validity of links.