sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.12k stars 901 forks source link

[feature request] Make C-extension threadsafe/Ractor-safe #3281

Closed mohamedhafez closed 3 days ago

mohamedhafez commented 3 days ago

In planning ahead to the near future when TruffleRuby can run C-extensions marked with rb_ext_ractor_safe(true) in parallel, and for when Ractors are no longer just experimental, it would be great if the C-extension could be made threadsafe, or marked as such if it already is so!

stevecheckoway commented 3 days ago

Libxml2 doesn't support concurrent modifications the same document. See https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Thread-safety

mohamedhafez commented 3 days ago

So the way Ractors work is that only one of them can access a given document object at a time, so libxml2's limitation of not supporting concurrent modifications on the same document actually shouldn't be an issue: https://ruby-doc.org/core-3.0.0/Ractor.html

What I'm hoping to avoid is that accessing different document objects can't be done concurrently, which is currently the case. According to the link you posted, libxml2 explicitly allows this as long as you:

  • configure the library accordingly using the --with-threads options

  • call xmlInitParser() in the "main" thread before using any of the libxml2 API (except possibly selecting a different memory allocator)

So I'm hoping this actually should be trivial!

(I'm addressing the use case of Ractors only here, since thats the only way it would happen in canonical, regular C-Ruby. Thread.new and Fibers are still subject to the GVL in C-Ruby, and Ractors are the only way to do true concurrency. TruffleRuby and JRuby users already know to protect access to the same object with a Mutex if they are doing multithreaded programming, and if they don't they are going to be screwed in a million other places;)

mohamedhafez commented 3 days ago

@eregon perhaps you or someone on the TruffleRuby team could lend a little more gravitas to my argument above? 😅

flavorjones commented 3 days ago

@mohamedhafez Thanks for opening this issue. Earlier this year I spent some time exploring how ractors and the sqlite3 gem interact, so I have questions.

Have you tried parsing and manipulating documents in different ractors? What was your experience like? What worked and what didn't work?

Our mental model is that although libxml2 doesn't support concurrent operations within a single document, each ractor should be able to parse and manipulate a separate document, and I'd like to update our mental model if your experience has been something different.

When you say "support for ractors" I'm trying to understand your specific use case, and what specific error message motivated you to open this issue. Passing objects between ractors can be hard for complex object graphs, and so any additional information you can provide would help me form better mental models.