stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.27k stars 891 forks source link

NER error after loading a CONLL-U document: doc.text is None #1428

Open zbeloki opened 2 days ago

zbeloki commented 2 days ago

I get the following error when running NER: TypeError: 'NoneType' object is not subscriptable

After debugging the error, I found out that it is trying to access the document's text attribute, but it is empty (None). I'm loading the document from a CONLL-U file created using Stanza, with the function stanza.utils.conll.conll2doc. So it seems loaded documents don't get their text attribute set. Each sentence has their text, but not the main document, which Stanza is trying to access in order to create the entity spans.

Is it possible to build the document's text from the sentences? That would fix the problem, I guess.

This is the entire stack trace:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 71, in cli.main() File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main run() File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file runpy.run_path(target, run_name="main") File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname) File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name) File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code exec(code, run_globals) File "stanza/prepare_eval_data.py", line 59, in main(args) File "stanza/prepare_eval_data.py", line 30, in main doc = nlp(doc_tokenized) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call return self.process(doc, processors) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process doc = process(doc) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/ner_processor.py", line 123, in process total = len(batch.doc.build_ents()) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 433, in build_ents s_ents = s.build_ents() File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 752, in build_ents self.ents.append(Span(tokens=ent_tokens, type=e['type'], doc=self.doc, sent=self)) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1601, in init self.init_from_tokens(tokens, type) File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1618, in init_from_tokens self.text = self.doc.text[self.start_char:self.end_char] TypeError: 'NoneType' object is not subscriptable

AngledLuffa commented 2 days ago

Do you have a code sample which shows this? I run a small example and it works fine:

>>> import stanza
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner")
>>> pipe("Dr. Pritchett gave me a new hip")
etc etc