Open rahonalab opened 4 months ago
Random request, this is really hard to read, please check the formatting next time on the stack traces
Try adding allow_unknown_language=True
to the Pipeline construction:
pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None, allow_unknown_language=True)
many thanks AngledLuffa, now it works. And sorry about the awful formatting :(
unfortunately, the new option and/or the dev branch doesn't seem to work. If I load models using the config dictionary, I get the following
2024-03-06 10:26:43 INFO: Using device: cuda 2024-03-06 10:26:43 INFO: Loading: tokenize 2024-03-06 10:26:43 DEBUG: With settings: 2024-03-06 10:26:43 DEBUG: {'model_path': '/corpus/saved_models/tokenize/sq_nel_tokenizer.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:00 DEBUG: Building Adam with lr=0.002000, betas=(0.9, 0.9), eps=0.000000, weight_decay=0.0 2024-03-06 10:27:01 INFO: Loading: mwt 2024-03-06 10:27:01 DEBUG: With settings: 2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/mwt/sq_nel_mwt_expander.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:01 DEBUG: Building an attentional Seq2Seq model... 2024-03-06 10:27:01 DEBUG: Using a Bi-LSTM encoder 2024-03-06 10:27:01 DEBUG: Using soft attention for LSTM. 2024-03-06 10:27:01 DEBUG: Finetune all embeddings. 2024-03-06 10:27:01 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000 2024-03-06 10:27:01 INFO: Loading: pos 2024-03-06 10:27:01 DEBUG: With settings: 2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/pos/sq_nel_nocharlm_tagger.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:01 DEBUG: Loading pretrain /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:02 DEBUG: Loaded pretrain from /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001 2024-03-06 10:27:03 INFO: Loading: lemma 2024-03-06 10:27:03 DEBUG: With settings: 2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/lemma/sq_nel_nocharlm_lemmatizer.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:03 DEBUG: Building an attentional Seq2Seq model... 2024-03-06 10:27:03 DEBUG: Using a Bi-LSTM encoder 2024-03-06 10:27:03 DEBUG: Using soft attention for LSTM. 2024-03-06 10:27:03 DEBUG: Using POS in encoder 2024-03-06 10:27:03 DEBUG: Finetune all embeddings. 2024-03-06 10:27:03 DEBUG: Running seq2seq lemmatizer with edit classifier... 2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000 2024-03-06 10:27:03 INFO: Loading: depparse 2024-03-06 10:27:03 DEBUG: With settings: 2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/depparse/sq_nel_nocharlm_parser_checkpoint.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:03 DEBUG: Reusing pretrain /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:04 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001 2024-03-06 10:27:05 INFO: Done loading processors! Reading: /corpus/texts/100Years_Albanian.txt Starting parser... endminiciep+ string found Parsing miniciep+ 2024-03-06 10:27:19 DEBUG: 6 batches created. 2024-03-06 10:27:22 DEBUG: 450 batches created. 2024-03-06 10:27:22 DEBUG: 127 batches created. Traceback (most recent call last): File "/tools/ud-stanza-ciep.py", line 119, in
main() File "/tools/ud-stanza-ciep.py", line 114, in main parseciep(nlp,file_content,filename,args.target,args.miniciep) File "/tools/parsing/stanza_parser.py", line 80, in parseciep miniciep = nlp(preparetext(splitciep[0])) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call return self.process(doc, processors) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process doc = process(doc) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/depparse_processor.py", line 57, in process raise ValueError("POS not run before depparse!") ValueError: POS not run before depparse!
but the pos processor is actually loaded! bonus question: what is the difference between
depparse
│ ├── sq_nel_nocharlm_parser_checkpoint.pt │ └── sq_nel_nocharlm_parser.pt
The _checkpoint
files include the optimizer and have the most recent state of the optimizer, even if the dev scores of the latest model didn't go up and therefore the main save file wasn't updated. You'll notice that the non-checkpoint file is much smaller than the checkpoint file... that's the optimizer. You can restart a training run that got interrupted in the middle, although if it got interrupted while saving the checkpoint file, you're probably screwed (something we should address)
I can see that you're loading the POS model first before the depparse. Sanity check first - is the POS model labeling either upos or xpos? If somehow it was trained to only label the features, I could see it throwing this kind of error. Otherwise, it really looks from the code that this particular error should happen - it only triggers if both upos and xpos are missing for a word.
if any(word.upos is None and word.xpos is None for sentence in document.sentences for word in sentence.words):
raise ValueError("POS not run before depparse!")
If the POS model should be working, what happens if you run the pipeline without the depparse and print out the results? Are there any sentences for which the POS is actually missing?
I wonder if that can happen if the POS model has blank tags in the dataset it's learning from
Many thanks for the detailed answer! This is really strange, I have tried to load the pipeline as I do in the script and it worked correctly on a few sentence. I have also tried to pass to the script a small txt file with some sentences and it worked too. But then I try to work on these txt files, as I did in the past, and it throws the error. I assume there's something in these sentences like an unknown word that triggers the error, how can I circumvent it? The model I am using is highly experimental, so I expect that it misses a lot of things. But, again. this is strange. I have trained in the past models on very small data and they worked correctly on this dataset I am trying to parse.
The model I am using is highly experimental, so I expect that it misses a lot of things
If it "misses" things to be incorrect, that's one thing. But I do very much wonder why it would label anything None
.
Are you able to send the data + the data you are trying to test on, or maybe just send the model and the test data? I'd really like to see it in action myself to debug this issue.
Another possible debugging step would be to examine the output of just the tokenizer and the POS w/o any of the subsequent models and check for any words which are missing both xpos and upos.
Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.
Is this something you want to fix on your end?
Maybe the tagger is supposed to ignore those items, or learn to tag them with _
... not sure which would be more productive
... to be more precise, it IS learning to tag words w/o tags with _
, and then the pipeline itself treats that the same as a blank tag.
Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.
Is this something you want to fix on your end?
thing is, I have already used these data to train a model two or three times last November and it worked fine. I have just added a few sentences for teaching the parser to recognize mwt like Albanian ta = të + e. I try to run the parser without depparse, and let you know...
It will successfully train a tagger even if there are empty tags. However, it's learned to recognize some words as having the empty tag, and that's the label the tagger gives those words. Did I express that clearly? I did the following experiment. Instead of sentences such as this in English, where the
gets the tags DET
and DT
22 which which PRON WDT PronType=Rel 26 obj 20:ref _
23 they they PRON PRP Case=Nom|Number=Plur|Person=3|PronType=Prs 26 nsubj 26:nsubj _
24 should should AUX MD VerbForm=Fin 26 aux 26:aux _
25 have have AUX VB VerbForm=Inf 26 aux 26:aux _
26 left leave VERB VBN Tense=Past|VerbForm=Part 20 acl:relcl 20:acl:relcl _
27 in in ADP IN _ 29 case 29:case _
28 the the DET DT Definite=Def|PronType=Art 29 det 29:det _
29 car car NOUN NN Number=Sing 26 obl 26:obl:in SpaceAfter=No
I changed all instances of the
to _
, so
22 which which PRON WDT PronType=Rel 26 obj 20:ref _
23 they they PRON PRP Case=Nom|Number=Plur|Person=3|PronType=Prs 26 nsubj 26:nsubj _
24 should should AUX MD VerbForm=Fin 26 aux 26:aux _
25 have have AUX VB VerbForm=Inf 26 aux 26:aux _
26 left leave VERB VBN Tense=Past|VerbForm=Part 20 acl:relcl 20:acl:relcl _
27 in in ADP IN _ 29 case 29:case _
28 the the _ _ Definite=Def|PronType=Art 29 det 29:det _
29 car car NOUN NN Number=Sing 26 obl 26:obl:in SpaceAfter=No
Now the tagger I trained labels the
with blank tags, which would trigger this error in the dependency parser, since it isn't expecting to receive blank tags.
I think it might make more sense to either throw an error when training a tagger on a partially complete file, or possibly treat single blank tags as masked out. Learning to recognize the blank tag doesn't seem very useful...
In the meantime, if you find and eliminate those blank tags from your dataset, I believe this error will go away.
ok, I have successfully parsed a file with just the pos tagging. Indeed, there are some tokens without UPOS. Actually, just one i.e., the stupid " punctuation 🔝 I have the same error in the training data, I'll correct and the error will likely go away. Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌
Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌
Indeed. I just need to figure out what the right approach is. The two leading candidates in my mind are to stop the tagger from training if there are blank UPOS, so as to give the user a chance to go back and fix the issue, or to treat the blanks as unlabeled tokens in the tagger which don't get a label of any kind.
The second one is more appealing to me ideologically, but the problem is that in a case similar to yours where maybe all the punctuation was unlabeled, then they would all get tagged with the most likely known tag at test time (perhaps NOUN, for example).
If you have an alternate suggestion, happy to hear it.
I have corrected the dataset, retrained the model and now the parser works fine. You might insert something in the dataset prepare process, telling the user that is training a model on 'wrong' data...
This error message is now part of the 1.8.2 release. Is there anything else you need addressed?
great! thank you, everything looks good!
Sorry for the double bug report. Can you please tell me what is the right procedure to load a model for a language that is not currently supported i..e, Albanian (sq). I have tried the following two things:
pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None)
It doesn't work:2024-03-02 15:25:18 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
Processor-specific arguments are set with keys "{processorname}{argument_name}"
` But, again, it doesn't work:
2024-03-02 16:00:25 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
As a workaround, I have put a code of a supported language, but it's not ideal, as it might load other models...
Thanks!