sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 896 forks source link

at_xpath and xpath return Nokogiri::XML::Element without proper methods like [] or key? #243

Closed apis closed 14 years ago

apis commented 14 years ago

Intermittently, probably every 3-4 runs I see following exception:


undefined method `key?' for nil:NilClass
/var/www/ruby/delcampe/app/scripts/crawler.rb:92:in `block in process_page'
/usr/local/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.1/lib/nokogiri/xml/node_set.rb:213:in `block in each'
/usr/local/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.1/lib/nokogiri/xml/node_set.rb:212:in `upto'
/usr/local/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.1/lib/nokogiri/xml/node_set.rb:212:in `each'
/var/www/ruby/delcampe/app/scripts/crawler.rb:77:in `process_page'
/var/www/ruby/delcampe/app/scripts/crawler.rb:27:in `get

O/S CentOS 5, ruby 1.9.1, Nokogiri 1.4.1

Here is the piece of code what causes it, 'description' variable is not nil, inspect shows that it is Nokogiri::XML::Element.


  def process_page(url)
    begin
      doc = get_page(url)
      entries = doc.xpath("//td[@class='itemsListingList']/table[2]/tr[4]/td/table/tr/td/table")
      items = []
      entries.each do |entry|
        hash = {}
        image = entry.at_xpath("tr[2]/td/a/img")
        if (image == nil)
          hash['image'] = nil
        else
          hash['image'] = image['src'].gsub(/\/img_thumb\//, '/img_large/')
        end

        description = entry.at_xpath("tr[3]/td/a")

        if (!description.key?('onclick'))
          href = description['href']
          hash['url'] = @server_name + '/' + href
          href =~ /id,(\d+),var/
          hash['item_number'] = $1.to_i
        else
          description['onclick'] =~ /window.open\("(.+?)".+\);/
          hash['url'] = $1
          $1 =~ /item,(\d+),view_type/
          hash['item_number'] = $1.to_i
        end
        hash['description'] = description.content

        seller = entry.at_xpath("tr[4]/td/span")
        hash['seller'] = seller.content

        items << hash
      end
      return items
    rescue => exception
      Rails.logger.error exception.to_s + "\n" + exception.backtrace.join("\n")
      puts exception.to_s + "\n" + exception.backtrace.join("\n")
      return nil
    end
  end
tenderlove commented 14 years ago

What version of libxml2 are you using? Can you send us the output of nokogiri -v please?

Also, are you sure that the data you get back from your external resource is the same every time?

apis commented 14 years ago

Here is my uname -a and nokogiri -v

Linux odessa 2.6.18-164.11.1.el5.028stab068.3 #1 SMP Wed Feb 17 15:22:30 MSK 2010 i686 i686 i386 GNU/Linux

---
warnings: []

nokogiri: 1.4.1
libxml: 
  binding: extension
  compiled: 2.6.26
  loaded: 2.6.26

I'll try to dump external web source output into file and play with it. I'll let you know about my results.

apis commented 14 years ago

I created test application which works always with the same data file, it loops 50 times modified version of routine above. On my CentOS box I see it failing consistently, on my Ubuntu 9.10 it works OK. You can download it from here: http://odessica.net/lepra/bug.zip

tenderlove commented 14 years ago

Before I try it out, can you make sure that you're using the same version of libxml2 on both systems? Version 2.6.26 is quite old, and most likely if you're seeing strange behavior on one system and not on the other, it's because you're using an older version on one system.

flavorjones commented 14 years ago

Yes, this is a libxml2 bug in pre-2.6.30ish. I remember when it was fixed, though I don't remember the exact ticket in the libxml bugzilla. Please confirm that you can't reproduce this with libxml later than 2.6.32.

apis commented 14 years ago

Guys, there is no official libxml2 rpm for CentOS 5 with version later than 2.6.26 and another complication here that this is my production system, so I should be extra careful. Basically I was chasing for 2 days trying to find best approach how to upgrade libxml2. So far the best approach what I found is to take this Fedora rpm libxml2-2.7.6-1.fc10.i386.rpm and apply it with --replacefiles option. I confirm it FIXES my issue. But it brings another one, every time when I run my code, I see following warning: WARNING: Nokogiri was built against LibXML version 2.6.26, but has dynamically loaded 2.7.6 and which is more frustrating crontab sends me angry emails each time when executes code with nokogiri calls.

tenderlove commented 14 years ago

If you reinstall the nokogiri gem, do you still have the same warning?

apis commented 14 years ago

Nope. It didn't help. Installing matching Fedora devel rpm libxml2-devel-2.7.6-1.fc10.i386.rpm and after that reinstalling nokogiri didn't help as well, it breaks something, so application fails to execute.
But I finally fixed it. I've downloaded src packages ftp://xmlsoft.org/libxml2/libxml2-2.7.7.tar.gz and ftp://xmlsoft.org/libxml2/libxslt-1.1.26.tar.gz built them with all defaults ./config; make install removed centos devel rpms for libxml2 and libxslt, reinstalled nokogiri gem and voila! it works like a charm! But it was not easy one, I should admit. I propose you to put it as a known issue with CentOS and RHEL 5 into release notes or FAQ. Thanks for your help!

flavorjones commented 14 years ago

I've updated the installation tutorial at http://nokogiri.org/ with mention to known issues and your instructions for building from source. Thanks!