Open npac opened 7 years ago
Thanks for opening this issue! I'll take a look.
On Mar 1, 2017 4:12 PM, "pcapcanari" notifications@github.com wrote:
Memory leak is observed while searching in a document with undefined namespace prefix.
Nokogiri (1.7.0.1)
--- warnings: [] nokogiri: 1.7.0.1 ruby: version: 2.3.1 platform: x86_64-linux description: ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux] engine: ruby libxml: binding: extension source: packaged libxml2_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4" libxslt_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29" libxml2_patches: [] libxslt_patches: [] compiled: 2.9.4 loaded: 2.9.4
Reproduced with:
require 'nokogiri'while true do begin
parse simple xml. no namespace definition
doc = Nokogiri::XML('<ns1:Root></ns1:Root>') # below code raises exception and leaks memory body = doc.search('//ns1:Root').first puts body.to_xml
rescue => e puts e end end
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1610, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgD2jPxXM6kfIeaRJbE38Ji1lrOTwJks5rhd9SgaJpZM4MQNWE .
I've reproduced this behavior. Digging in.
OK, this leak appears to come from using Nokogiri_error_raise
to raise an exception during evaluation of the xpath query, which under the hood does a C longjmp
and prevents libxml2 from cleaning up memory that had been allocated before the error was encountered.
Nokogiri_error_raise
is used in a handful of places, and so now I'm worried that we may be leaking memory under other circumstances as well. I think I'd like to take a look at overhauling our current error-handling pattern to avoid raising exceptions while libxml2 is in the middle of something.
I've audited the C code for non-trivial raises, use of handlers, and general patterns of error handling:
Nokogiri_error_raise
is used38- if (xmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
39- if (!(ctx->options & XML_PARSE_RECOVER)) {
40- xmlErrorPtr e = xmlCtxtGetLastError(ctx);
41: Nokogiri_error_raise(NULL, e);
42- }
43- }
44-
characterization:
81- if(NULL == schema) {
82- xmlErrorPtr error = xmlGetLastError();
83- if(error)
84: Nokogiri_error_raise(NULL, error);
85- else
86- rb_raise(rb_eRuntimeError, "Could not parse document");
87-
characterization:
133- if(NULL == schema) {
134- xmlErrorPtr error = xmlGetLastError();
135- if(error)
136: Nokogiri_error_raise(NULL, error);
137- else
138- rb_raise(rb_eRuntimeError, "Could not parse document");
139-
characterization:
23- if(htmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
24- if (!(ctx->options & XML_PARSE_RECOVER)) {
25- xmlErrorPtr e = xmlCtxtGetLastError(ctx);
26: Nokogiri_error_raise(NULL, e);
27- }
28- }
29-
characterization:
209- }
210-
211- xmlResetLastError();
212: xmlSetStructuredErrorFunc(NULL, Nokogiri_error_raise);
213-
214- /* For some reason, xmlXPathEvalExpression will blow up with a generic error */
215- /* when there is a non existent function. */
characterization:
120- if(NULL == schema) {
121- xmlErrorPtr error = xmlGetLastError();
122- if(error)
123: Nokogiri_error_raise(NULL, error);
124- else
125- rb_raise(rb_eRuntimeError, "Could not parse document");
126-
characterization:
173- if(NULL == schema) {
174- xmlErrorPtr error = xmlGetLastError();
175- if(error)
176: Nokogiri_error_raise(NULL, error);
177- else
178- rb_raise(rb_eRuntimeError, "Could not parse document");
179-
characterization:
rb_exc_raise
is used458-
459- error = xmlGetLastError();
460- if(error)
461: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
462- else
463- rb_raise(rb_eRuntimeError, "Error pulling: %d", ret);
464-
characterization:
1401-
1402- error = xmlGetLastError();
1403- if(error)
1404: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
1405- else
1406- rb_raise(rb_eRuntimeError, "Could not perform xinclude substitution");
1407- }
characterization:
80- if (!ss) {
81- xmlFreeDoc(xml_cpy);
82- exception = rb_exc_new3(rb_eRuntimeError, errstr);
83: rb_exc_raise(exception);
84- }
85-
86- return Nokogiri_wrap_xslt_stylesheet(ss);
characterization:
177-
178- if (parse_error_occurred) {
179- exception = rb_exc_new3(rb_eRuntimeError, errstr);
180: rb_exc_raise(exception);
181- }
182-
183- return Nokogiri_wrap_xml_document((VALUE)0, result) ;
characterization:
221-
222- if(xpath == NULL) {
223- xmlErrorPtr error = xmlGetLastError();
224: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
225- }
226-
227- assert(ctx->doc);
characterization:
254-
255- error = xmlGetLastError();
256- if(error)
257: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
258- else
259- rb_raise(rb_eRuntimeError, "Could not parse document");
260-
characterization:
298-
299- error = xmlGetLastError();
300- if(error)
301: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
302- else
303- rb_raise(rb_eRuntimeError, "Could not parse document");
304-
characterization:
446- if(NULL == ptr) {
447- xmlErrorPtr error = xmlGetLastError();
448- if(error)
449: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
450- else
451- rb_raise(rb_eRuntimeError, "Could not create entity");
452-
characterization:
66- VALUE encoding_found = rb_funcall(io, id_encoding_found, 0);
67- if (!NIL_P(encoding_found)) {
68- xmlFreeDoc(doc);
69: rb_exc_raise(encoding_found);
70- }
71- }
72-
characterization:
77-
78- error = xmlGetLastError();
79- if(error)
80: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
81- else
82- rb_raise(rb_eRuntimeError, "Could not parse document");
83-
characterization:
123-
124- error = xmlGetLastError();
125- if(error)
126: rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
127- else
128- rb_raise(rb_eRuntimeError, "Could not parse document");
129-
characterization:
rb_raise
is used547-
548- if(reader == NULL) {
549- xmlFreeTextReader(reader);
550: rb_raise(rb_eRuntimeError, "couldn't create a parser");
551- }
552-
553- rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);
characterization:
592-
593- if(reader == NULL) {
594- xmlFreeTextReader(reader);
595: rb_raise(rb_eRuntimeError, "couldn't create a parser");
596- }
597-
598- rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);
characterization:
1491- switch (error) {
1492- case XML_ERR_INTERNAL_ERROR:
1493- case XML_ERR_NO_MEMORY:
1494: rb_raise(rb_eRuntimeError, "error parsing fragment (%d)", error);
1495- break;
1496- default:
1497- break;
characterization:
69- filename
70- );
71- if (ctx == NULL) {
72: rb_raise(rb_eRuntimeError, "Could not create a parser context");
73- }
74-
75- ctx->userData = NOKOGIRI_SAX_TUPLE_NEW(ctx, self);
characterization:
93- Data_Get_Struct(self, xmlParserCtxt, ctx);
94-
95- if (xmlCtxtUseOptions(ctx, (int)NUM2INT(options)) != 0) {
96: rb_raise(rb_eRuntimeError, "Cannot set XML parser context options");
97- }
98-
99- return Qnil;
characterization:
I'll further note that this is the script I used to generate the samplings of code I filtered through above (in case I need to re-do the analysis after so much time):
#!/usr/bin/env ruby
target = "ext/nokogiri"
command = "ack -C3 --group --nocolor --ignore-file=match:xml_syntax_error*"
expressions = %w[
Nokogiri_error_raise
rb_exc_raise
rb_raise
]
expressions.each do |expression|
puts "## places where `#{expression}` is used"
puts
cmd = "#{command} #{expression} #{target}"
output = `#{cmd}`
output.split("\n").each do |line|
if line =~ /\.[ch]$/
puts "### #{line}"
puts
puts "```C"
elsif line.empty?
puts "```"
puts
elsif line =~ /^--$/
puts "```"
puts
puts "```C"
else
puts line
end
end
puts "```"
puts
end
Note that TruffleRuby would also prefer if we didn't raise from a native code callback, see https://github.com/sparklemotion/nokogiri/issues/1882
Recent relevant post from @peterzhu2118 with explanatory notes: https://blog.peterzhu.ca/ruby-c-ext-part-8/
Note also some exploratory work I've started at #2096
Memory leak is observed while searching in a document with undefined namespace prefix.
Reproduced with: