glossarist / iev-data

1 stars 1 forks source link

Multithreading #139

Open skalee opened 3 years ago

skalee commented 3 years ago

This thing could benefit a lot from multithreading, both on desktop and in GHA (the latter features 2 or 3 CPU cores). Most of the processing is parsing various strings, and this can be done in parallel. Nokogiri is thread-safe (at least they have thread-safety fixes in their changelog). I'm not sure if Relaton is thread-safe, but we can wrap it in a monitor.

ronaldtse commented 3 years ago

Good idea, combined with new concurrency features in Ruby 3 it would work well.

skalee commented 3 years ago

I'm going to use old good Queue class for thread synchronization. It's going to be a really simple concurrency model with all processing threads running the same code, so I guess actors won't give us anything. Anyway, we can't use Ruby 3 because Metanorma is not ready for that (unless something changed).

ronaldtse commented 3 years ago

I don't think iev-data relies on Metanorma? Maybe you mean Relaton?

You plan to use a single Queue for multiple thread consumers, right? Because the writing of individual files are separate we don't need to sync after processing.

skalee commented 3 years ago

I don't think iev-data relies on Metanorma? Maybe you mean Relaton?

Yes, I meant Relaton, sorry.

You plan to use a single Queue for multiple thread consumers, right?

Exactly.

Because the writing of individual files are separate we don't need to sync after processing.

Files are currently written at the very end and I'm not going to change that. It will be performed by a single thread. (Though actually writing YAML as soon as given concept is complete sounds appealing.)

ronaldtse commented 3 years ago

This is fine; I suppose we can also do verification at the end such as with links.

skalee commented 3 years ago

So far I'm not getting any good results on this matter. I don't know why yet — this type of processing should be really straightforward to parallelize, but the performance gains are minimal. Maybe it's because of GIL — I did some profiling and I learned that the program spends most of time on math conversions, which rely on native extensions, which are typically synchronized with GIL.

Furthermore, it seems that CLI UI operations are not atomic and not thread-safe, see https://github.com/glossarist/iev-data/pull/143#issue-592373342. This needs to be addressed too.

I'll keep this issue open, but adding multithreading doesn't seem as easy and as beneficial as I initially thought. I need to rework math conversions anyway, as there are plenty of reported issues. Perhaps I'll find a way to eliminate the bottleneck and multithreading will make sense again.