sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.12k stars 901 forks source link

adopt save-and-restore pattern for libxml2 error handlers everywhere #2172

Open flavorjones opened 3 years ago

flavorjones commented 3 years ago

See #2168 and #2169 for details, but the short version is that we should be more rigorous about saving-and-restoring error handlers and error handler metadata around libxml2 calls, in case any are being made recursively within Nokogiri.

For example, these lines in Nokogiri::HTML::Document::EncodingReader are calling HTML::SAX::PushParser to parse a chunk from the IO read callback of a regular document parse:

https://github.com/sparklemotion/nokogiri/blob/7be6f04aa2700e818f8a3bfe82801b5bd6e8c4f4/lib/nokogiri/html/document.rb#L275-L277

To allow users to do similarly complex things, we should always save-and-restore the error callbacks (which are the only global state I can think of that we regularly manipulate).

We're doing this in the HTML::SAX::PushParser class to cover ourselves in the aforementioned case:

https://github.com/sparklemotion/nokogiri/blob/7be6f04aa2700e818f8a3bfe82801b5bd6e8c4f4/ext/nokogiri/html_sax_push_parser.c#L24-L28

https://github.com/sparklemotion/nokogiri/blob/7be6f04aa2700e818f8a3bfe82801b5bd6e8c4f4/ext/nokogiri/xml_syntax_error.h#L6-L17

https://github.com/sparklemotion/nokogiri/blob/7be6f04aa2700e818f8a3bfe82801b5bd6e8c4f4/ext/nokogiri/xml_syntax_error.c#L3-L24

This issue is opened to make sure we remember to do this everywhere.

It's somewhat related to wrapping we need to do around any libxml2 callbacks which re-enter the Ruby interpreter and how we handle those exceptions, all of which are detailed at #1610.

nwellnhof commented 2 weeks ago

Also see xmlCtxtSetErrorHandler available since libxml2 2.13.0.

flavorjones commented 2 weeks ago

Thanks for the pointer, @nwellnhof, that's great, and I'll make sure the error functions take the additional parser context argument so that I can polyfill.

Also note, the function names were changed in 04001e7c to be noko__structured_error_func_save_and_set, etc.

In that commit I also did a bit of the described wrapping.

nwellnhof commented 2 weeks ago

I don't think the polyfill approach will work here. Save-and-restore is the only option for older libxml2.

flavorjones commented 2 weeks ago

@nwellnhof Sorry, you're right, I used "polyfill" when I meant "shim", something like

void
noko__structured_error_func_save_and_set(
  xmlParserCtxtPtr ctxt,
  libxmlStructuredErrorHandlerState *handler_state,
  void *user_data,
  xmlStructuredErrorFunc handler
)
{
#ifdef HAVE_XMLCTXTSETERRORHANDLER
  xmlCtxtSetErrorHandler(ctxt, handler, user_data);
#else
  noko__structured_error_func_save(handler_state);
  xmlSetStructuredErrorFunc(user_data, handler);
#endif
}

though now I see that there are flavors of this for XPath, XInclude, TestReader, and schemas, too ... in any case, we'll migrate towards these APIs over time for sure.

nwellnhof commented 2 weeks ago

That would work but the main benefit is that xmlCtxtSetErrorHandler only has to be called once on a parser context, preferably right after creating it.