snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Clarification on if `snowball` (specifically python implemenation) is not thread-safe #162

Closed DBCerigo closed 1 year ago

DBCerigo commented 2 years ago

Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using snowball with dask multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.

Could someone with expertise input on whether snowball is thread-safe or not?

Might be related to #146 which seems to imply that the C# implementation is not thread safe.

DBCerigo commented 2 years ago

Found it in the docs https://github.com/zvelo/libstemmer/blob/78c149a3a6f262a35c7f7351d3f77b725fc646cf/README#L45

Stemmers are re-entrant, but not threadsafe.  In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.

Apologies for the extra noise.

ojwb commented 2 years ago

I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.

Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C struct or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.

DBCerigo commented 2 years ago

I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.

Thanks for the clarification, appreciated.

Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C struct or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.

I couldn't find a similar reference to Snowball not being thread-safe within this repo, but it is likely I have just not managed to find it. If there isn't anywhere in the docs that state that Snowball instances are not thread-safe and that to use in a multi-thread/process run one should instantiate a Snowball stemmer instance within each thread (as opposed to sharing one instance across multiple threads), then that could be a valuable addition to the docs to help future uses avoid such race-condition intermittent bugs that are particularly hard to track down.

Thanks for all your time and energy that goes into this package. It is appreciated!

ojwb commented 2 years ago

doc/libstemmer_c_README is the equivalent of the file you were looking at in the other repo.

As you say, we should document this in all the target language README files though.

DBCerigo commented 2 years ago

~@ojwb if you point me to where that is I can have a go at adding it if that would be helpful.~

I misread you message, it's clear what is requested.

ojwb commented 2 years ago

A patch would be lovely. The target language README files are either in doc/ or the language subdirectory - see output of: git ls-files '*/*README*'

One thought - I don't write a lot of heavily multithreaded code, but I think it would probably be better advice to suggest creating one stemmer object per thread wanting to stem words - currently we talk about sharing one stemmer object and protecting it with a mutex or similar, but that adds contention on a single resource without a good reason to.

DBCerigo commented 2 years ago

https://github.com/snowballstem/snowball/pull/163 :)

I concur(ed) with your thought on advising user to instantiate per thread is more user friendly than advising them to use a mutex/lock.

ojwb commented 1 year ago

Fixed by #163.