stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.22k stars 888 forks source link

Multi-processing stopped working since 1.4 #1242

Open BramVanroy opened 1 year ago

BramVanroy commented 1 year ago

Describe the bug spacy_stanza allows users to get the output back in spaCy format, but also integrates a nifty multi-processing option through nlp.pipe(data, n_process=4). This used to work well (<1.4.0) but we recently found that this functionality is not compatible with stanza any more because of pickling issues (error trace below). So something must have changed since 1.4.0 that is not piclable anymore. I remember having a similar issue before (it was a lambda function). It might be useful to add a pickler test to the test suite.

We could just stick with 1.3.0 but that means we have to miss out on constituency parsing.

To Reproduce

python -m venv .venv --prompt "stanza test"
.venv/bin/activate
python -m pip install spacy spacy-stanza
python -c "import stanza; stanza.download('es')"
import spacy_stanza

if __name__ == '__main__':
    data = ["I like cokies", "Do you like them ?", "Lets' start a bakery !"]
    nlp_spacy_stanza = spacy_stanza.load_pipeline("es", processors="tokenize,pos,lemma,depparse",
                                                  use_gpu=False,
                                                  tokenize_pretokenized=True)

    docs = list(nlp_spacy_stanza.pipe(data, n_process=3))

# Does not work on 1.5.0
# Does not work on 1.4.0
# Works on 1.3.0 but: UserWarning: [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.
#   byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

Additional context

Error message in unsopprted stanza versions (1.4, 1.5):

Traceback (most recent call last):
docs = list(nlp_spacy_stanza.pipe(data, n_process=3))
File ".venv\lib\site-packages\spacy\language.py", line 1574, in pipe
for doc in docs:
File ".venv\lib\site-packages\spacy\language.py", line 1640, in _multiprocessing_pipe
proc.start()
File "lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
AngledLuffa commented 1 year ago

I tried your code example, thanks for including that. However, my experience is different from yours on Linux. There, it hangs with no errors or warnings whatsoever. This is with spacy 3.5.2, spacy-stanza 1.0.3, and the current dev branch of Stanza.

I found this issue from a couple years ago: https://github.com/explosion/spacy-stanza/issues/34

If I add the lines suggested in that issue:

import torch

torch.set_num_threads(1)

It works now. My impression is there isn't much we can do on our side to make it work with torch's num_threads > 1, but if there is, please let me know and we can keep looking. I do get the warning

/usr/local/lib64/python3.9/site-packages/spacy/language.py:2273: UserWarning: [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.
  byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]

To be entirely honest, I have no idea what could be causing this. A brief google search only turns up more people asking about the same issue. I guess it's something in the spacy doc which can't be serialized to the children processes?

I do find it a bit hard to believe that using multiprocessing with torch in single thread CPU mode is going to be faster than either using the GPU or just a single process with torch being multithreaded, but it is true that processors such as the tokenizer and the constituency parser spend quite a bit of time in CPU, so maybe it works out.