Closed DBCerigo closed 1 year ago
Found it in the docs https://github.com/zvelo/libstemmer/blob/78c149a3a6f262a35c7f7351d3f77b725fc646cf/README#L45
Stemmers are re-entrant, but not threadsafe. In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.
Apologies for the extra noise.
I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.
Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C struct
or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.
I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.
Thanks for the clarification, appreciated.
Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C
struct
or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.
I couldn't find a similar reference to Snowball not being thread-safe within this repo, but it is likely I have just not managed to find it. If there isn't anywhere in the docs that state that Snowball instances are not thread-safe and that to use in a multi-thread/process run one should instantiate a Snowball stemmer instance within each thread (as opposed to sharing one instance across multiple threads), then that could be a valuable addition to the docs to help future uses avoid such race-condition intermittent bugs that are particularly hard to track down.
Thanks for all your time and energy that goes into this package. It is appreciated!
doc/libstemmer_c_README
is the equivalent of the file you were looking at in the other repo.
As you say, we should document this in all the target language README files though.
~@ojwb if you point me to where that is I can have a go at adding it if that would be helpful.~
I misread you message, it's clear what is requested.
A patch would be lovely. The target language README files are either in doc/
or the language subdirectory - see output of: git ls-files '*/*README*'
One thought - I don't write a lot of heavily multithreaded code, but I think it would probably be better advice to suggest creating one stemmer object per thread wanting to stem words - currently we talk about sharing one stemmer object and protecting it with a mutex or similar, but that adds contention on a single resource without a good reason to.
https://github.com/snowballstem/snowball/pull/163 :)
I concur(ed) with your thought on advising user to instantiate per thread is more user friendly than advising them to use a mutex/lock.
Fixed by #163.
Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using
snowball
withdask
multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.Could someone with expertise input on whether
snowball
is thread-safe or not?Might be related to #146 which seems to imply that the C# implementation is not thread safe.