sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 897 forks source link

Null pointer when xml parsing in jruby 1.6.4 and nokogiri (1.5.0 java) #548

Closed tc closed 12 years ago

tc commented 12 years ago

I get a null pointer error when parsing the string below:

Works in ruby 1.9.2

xml_string = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n<articles loadtime=\"0 sec\" rendertime=\"0.022 sec\" totaltime=\"0.022 sec\"><article><paragraph><template name=\"Infobox Radio station\">\\n<param name=\"name\">WXYG</param>\\n<param name=\"image\"/>\\n<param name=\"city\"><link><target>Sauk Rapids, Minnesota</target></link></param>\\n<param name=\"area\"/>\\n<param name=\"branding\"><italics>Album Rock 540, The Goat</italics></param>\\n<param name=\"slogan\"/>\\n<param name=\"airdate\"/>\\n<param name=\"frequency\">540<space/><link><target>kilohertz</target><part>kHz</part></link></param>\\n<param name=\"format\"><link><target>Classic rock</target></link></param>\\n<param name=\"power\">250<space/><link><target>watt</target><trail>s</trail></link><space/>(<link><target>Daytime (astronomy)</target><part>day</part></link>)<xhtml:br xmlns:xhtml=\"\"/>250 watts (<link><target>night</target></link>)</param>\\n<param name=\"class\">B</param>\\n<param name=\"facility_id\">161448</param>\\n<param name=\"coordinates\"><template name=\"coord\">\\n<param name=\"1\">45</param>\\n<param name=\"2\">36</param>\\n<param name=\"3\">18</param>\\n<param name=\"4\">N</param>\\n<param name=\"5\">94</param>\\n<param name=\"6\">08</param>\\n<param name=\"7\">21</param>\\n<param name=\"8\">W</param>\\n</template></param>\\n<param name=\"callsign_meaning\"/>\\n<param name=\"former_callsigns\">WXYG (2007-2008)<xhtml:br xmlns:xhtml=\"\"/>WMIN (8/08-12/08)<xhtml:br xmlns:xhtml=\"\"/>WPPI (2008-2009)<extension extension_name=\"ref\" name=\"fcc1\"/></param>\\n<param name=\"affiliations\"/>\\n<param name=\"owner\"><link><target>Tri-County Broadcasting</target></link></param>\\n<param name=\"licensee\">Herbert M. Hoppe</param>\\n<param name=\"sister_stations\"><link><target>WBHR</target></link>,<space/><link><target>WHMH-FM</target></link>,<space/><link><target>WMIN</target></link>,<space/><link><target>WVAL</target></link></param>\\n<param name=\"webcast\"><template name=\"listen live\">\\n<param name=\"1\">http://www.540wxyg.com</param>\\n</template></param>\\n<param name=\"website\"><link type=\"external\" href=\"http://www.540wxyg.com\">540wxyg.com</link></param>\\n</template></paragraph><paragraph><sentence id=\"4614374/0\"><bold><link synthetic=\"true\"><target>WXYG</target></link></bold><space/>(540<space/><link><target>AM broadcasting</target><part>AM</part></link>) is an American<space/><link><target>radio station</target></link><space/>intended to serve<space/><link><target>Sauk Rapids, Minnesota</target></link>, USA.</sentence> <sentence id=\"4614374/1\">The station is part of the<space/><link><target>Tri-County Broadcasting</target></link><space/>group and the<space/><link><target>construction permit</target></link><space/>is held by Herbert M.</sentence> <sentence id=\"4614374/2\">Hoppe.</sentence> <sentence id=\"4614374/3\"><link synthetic=\"true\"><target>WXYG</target></link> recently signed on the air with a<space/><link><target>classic rock</target></link><space/>format.</sentence></paragraph><heading level=\"2\">History</heading><paragraph><sentence id=\"4614374/4\">This station received its original<space/><link><target>construction permit</target></link><space/>from the<space/><link><target>Federal Communications Commission</target></link><space/>on July 26,<space/><link><target>2007 in radio</target><part>2007</part></link>.</sentence><extension extension_name=\"ref\" name=\"fccp\"><template name=\"cite web\">\\n<param name=\"publisher\">FCC Media Bureau</param>\\n<param name=\"title\">Application Search Details (BNP-20040130BCE)</param>\\n<param name=\"url\">http://licensing.fcc.gov/cgi-bin/ws.exe/prod/cdbs/pubacc/prod/app_det.pl?Application_id=975517</param>\\n<param name=\"date\">July 26, 2007</param>\\n</template><template name=\"cite web\">\\n<param name=\"publisher\">FCC Media Bureau</param>\\n<param name=\"title\">Application Search Details (BNP-20040130BCE)</param>\\n<param name=\"url\">http://licensing.fcc.gov/cgi-bin/ws.exe/prod/cdbs/pubacc/prod/app_det.pl?Application_id=975517</param>\\n<param name=\"date\">July 26, 2007</param>\\n</template></extension><space/><sentence id=\"4614374/5\">The new station was assigned the<space/><link><target>call sign</target></link><space/><link synthetic=\"true\"><target>WXYG</target></link> by the FCC on September 10, 2007.</sentence><extension extension_name=\"ref\" name=\"fcc1\"><template name=\"cite web\">\\n<param name=\"title\">Call Sign History</param>\\n<param name=\"url\">http://licensing.fcc.gov/cgi-bin/ws.exe/prod/cdbs/pubacc/prod/call_hist.pl?Facility_id=161448&amp;Callsign=WXYG</param>\\n<param name=\"publisher\">FCC Media Bureau CDBS Public Access Database</param>\\n<param name=\"accessdate\">June 10, 2009</param>\\n</template><template name=\"cite web\">\\n<param name=\"title\">Call Sign History</param>\\n<param name=\"url\">http://licensing.fcc.gov/cgi-bin/ws.exe/prod/cdbs/pubacc/prod/call_hist.pl?Facility_id=161448&amp;Callsign=WXYG</param>\\n<param name=\"publisher\">FCC Media Bureau CDBS Public Access Database</param>\\n<param name=\"accessdate\">June 10, 2009</param>\\n</template></extension><space/><sentence id=\"4614374/6\">The call sign was changed to <link synthetic=\"true\"><target>WMIN</target></link> on August 12, 2008, to WPPI on December 2, 2008, and back to <link synthetic=\"true\"><target>WXYG</target></link> on December 14, 2009.</sentence><extension extension_name=\"ref\" name=\"fcc1\"/><space/><sentence id=\"4614374/7\">This construction permit was scheduled to expire on July 25, 2010.</sentence><extension extension_name=\"ref\" name=\"fccp\"/></paragraph><paragraph><sentence id=\"4614374/8\">As of November 8, 2010, <link synthetic=\"true\"><target>WXYG</target></link>, which has been occasionally testing with a mix of rock and<space/><link><target>country music</target></link><space/>since June, began playing<space/><link><target>Christmas music</target></link>.</sentence> <sentence id=\"4614374/9\">The station resumed testing after the holiday season.</sentence> <sentence id=\"4614374/10\">On May 23, 2011, the FCC granted the station<space/><link><target>program text authority</target></link><space/>to begin broadcasting before receiving its broadcast license.</sentence><space/></paragraph><paragraph><sentence id=\"4614374/11\">On June 24, 2011 <link synthetic=\"true\"><target>WXYG</target></link> ended testing and signed on the air with album rock, branded as \"Album <link synthetic=\"true\"><target>Classic rock</target><part>Rock</part></link> 540, The Goat\".</sentence></paragraph><heading level=\"2\">References</heading><paragraph><template name=\"reflist\">\\n</template></paragraph><heading level=\"2\">External links</heading><list type=\"bullet\"><listitem><sentence id=\"4614374/12\"><link type=\"external\" href=\"http://www.tricountybroadcasting.com/\">Tri-County Broadcasting</link></sentence></listitem><listitem><template name=\"AM station data\">\\n<param name=\"1\">WXYG</param>\\n</template></listitem></list><paragraph><template name=\"St. Cloud Radio\">\\n</template><template name=\"Classic Rock Radio Stations in Minnesota\">\\n</template></paragraph><paragraph><sentence id=\"4614374/13\"><link><target>Category:Benton County, Minnesota</target></link><link><target>Category:Proposed radio stations</target></link></sentence></paragraph><paragraph><template name=\"Minnesota-radio-station-stub\">\\n</template></paragraph></article></articles>" 

doc = Nokogiri::XML(xml_string)
doc.text
Java::JavaLang::NullPointerException: 
    from nokogiri.XmlNode.content(XmlNode.java:730)
    from nokogiri.XmlNode$i$0$0$content.call(XmlNode$i$0$0$content.gen:65535)
    from org.jruby.internal.runtime.methods.AliasMethod.call(AliasMethod.java:56)
    from org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:292)
    from org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:135)
    from org.jruby.ast.CallNoArgNode.interpret(CallNoArgNode.java:63)
    from org.jruby.ast.NewlineNode.interpret(NewlineNode.java:104)
    from org.jruby.ast.RootNode.interpret(RootNode.java:129)
    from org.jruby.evaluator.ASTInterpreter.INTERPRET_EVAL(ASTInterpreter.java:96)
    from org.jruby.evaluator.ASTInterpreter.evalWithBinding(ASTInterpreter.java:161)
    from org.jruby.RubyKernel.evalCommon(RubyKernel.java:1135)
    from org.jruby.RubyKernel.eval(RubyKernel.java:1088)
    from org.jruby.RubyKernel$s$0$3$eval.call(RubyKernel$s$0$3$eval.gen:65535)
    from org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:181)
    from org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:69)
    from org.jruby.ast.FCallManyArgsNode.interpret(FCallManyArgsNode.java:60)
... 114 levels...
    from org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:75)
    from org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:190)
    from org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:179)
    from org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:312)
    from org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:169)
    from Users.t.$_dot_rvm.rubies.jruby_minus_1_dot_6_dot_4.bin.irb.__file__(/Users/t/.rvm/rubies/jruby-1.6.4/bin/irb:17)
    from Users.t.$_dot_rvm.rubies.jruby_minus_1_dot_6_dot_4.bin.irb.load(/Users/t/.rvm/rubies/jruby-1.6.4/bin/irb)
    from org.jruby.Ruby.runScript(Ruby.java:679)
    from org.jruby.Ruby.runScript(Ruby.java:672)
    from org.jruby.Ruby.runNormally(Ruby.java:579)
    from org.jruby.Ruby.runFromMain(Ruby.java:428)
    from org.jruby.Main.doRunFromMain(Main.java:278)
    from org.jruby.Main.internalRun(Main.java:198)
    from org.jruby.Main.run(Main.java:164)
    from org.jruby.Main.run(Main.java:148)
yokolet commented 12 years ago

Hello!

You have "\n" instead of "\n" in the document. Thus, your document couldn't be parsed even though I used libxml version of Nokogiri. So, first, I substituted all "\n" to "\n", then tried to parse the document. Then, I didn't get NullPointerException at all. However, the xml document have several errors except "\n". The xml document seems to be an invalid. Because of this, Java version could not parse whole document, and got very few texts. While, libxml version is not strict to validity, so I got bunch of texts from the line "doc.text" .

The difference of parser behavior is very hard to overcome. Unless the document is parsed successfully, Nokogiri's methods can't do anything. So, would you review the document?

tc commented 12 years ago

The document came from freebase WEX format. It's an attempt at transforming mediawiki into xml but i guess it doesn't produce semantically correct xml.