Open BramVanroy opened 1 year ago
I tried your code example, thanks for including that. However, my experience is different from yours on Linux. There, it hangs with no errors or warnings whatsoever. This is with spacy 3.5.2, spacy-stanza 1.0.3, and the current dev branch of Stanza.
I found this issue from a couple years ago: https://github.com/explosion/spacy-stanza/issues/34
If I add the lines suggested in that issue:
import torch
torch.set_num_threads(1)
It works now. My impression is there isn't much we can do on our side to make it work with torch's num_threads > 1, but if there is, please let me know and we can keep looking. I do get the warning
/usr/local/lib64/python3.9/site-packages/spacy/language.py:2273: UserWarning: [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
To be entirely honest, I have no idea what could be causing this. A brief google search only turns up more people asking about the same issue. I guess it's something in the spacy doc which can't be serialized to the children processes?
I do find it a bit hard to believe that using multiprocessing with torch in single thread CPU mode is going to be faster than either using the GPU or just a single process with torch being multithreaded, but it is true that processors such as the tokenizer and the constituency parser spend quite a bit of time in CPU, so maybe it works out.
Describe the bug
spacy_stanza
allows users to get the output back in spaCy format, but also integrates a nifty multi-processing option throughnlp.pipe(data, n_process=4)
. This used to work well (<1.4.0) but we recently found that this functionality is not compatible withstanza
any more because of pickling issues (error trace below). So something must have changed since 1.4.0 that is not piclable anymore. I remember having a similar issue before (it was a lambda function). It might be useful to add a pickler test to the test suite.We could just stick with 1.3.0 but that means we have to miss out on constituency parsing.
To Reproduce
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
Additional context
Error message in unsopprted stanza versions (1.4, 1.5):