rahonalab commented 4 months ago

Sorry for the double bug report. Can you please tell me what is the right procedure to load a model for a language that is not currently supported i..e, Albanian (sq). I have tried the following two things:

I have created a full resources.json file in a new directory and load it, telling stanza to not download a new resource file: pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None) It doesn't work:

2024-03-02 15:25:18 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value

I have initialized a custom Config and passed it to the pipeline: `# Language code for the language to build the Pipeline in 'lang': 'sq',

Processor-specific arguments are set with keys "{processorname}{argument_name}"

                # You only need model paths if you have a specific model outside of stanza_resources
                'tokenize_model_path': '/corpus/models/stanza/sq/tokenize/sq_nel_tokenizer.pt',
                'pos_model_path': '/corpus/models/stanza/sq/pos/sq_nel_tagger.pt',
                'lemma_model_path': '/corpus/models/stanza/sq/lemma/sq_nel_lemmatizer.pt',
                'depparse_model_path': '/corpus/models/stanza/sq/depparse/sq_nel_parser.pt',
                'pos_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt',
                'depparse_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt',
                })

` But, again, it doesn't work:

2024-03-02 16:00:25 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value

As a workaround, I have put a code of a supported language, but it's not ideal, as it might load other models...

Thanks!

AngledLuffa commented 4 months ago

Random request, this is really hard to read, please check the formatting next time on the stack traces

AngledLuffa commented 4 months ago

Try adding allow_unknown_language=True to the Pipeline construction:

pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None, allow_unknown_language=True)

rahonalab commented 4 months ago

many thanks AngledLuffa, now it works. And sorry about the awful formatting :(

rahonalab commented 4 months ago

unfortunately, the new option and/or the dev branch doesn't seem to work. If I load models using the config dictionary, I get the following

2024-03-06 10:26:43 INFO: Using device: cuda 2024-03-06 10:26:43 INFO: Loading: tokenize 2024-03-06 10:26:43 DEBUG: With settings: 2024-03-06 10:26:43 DEBUG: {'model_path': '/corpus/saved_models/tokenize/sq_nel_tokenizer.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:00 DEBUG: Building Adam with lr=0.002000, betas=(0.9, 0.9), eps=0.000000, weight_decay=0.0 2024-03-06 10:27:01 INFO: Loading: mwt 2024-03-06 10:27:01 DEBUG: With settings: 2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/mwt/sq_nel_mwt_expander.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:01 DEBUG: Building an attentional Seq2Seq model... 2024-03-06 10:27:01 DEBUG: Using a Bi-LSTM encoder 2024-03-06 10:27:01 DEBUG: Using soft attention for LSTM. 2024-03-06 10:27:01 DEBUG: Finetune all embeddings. 2024-03-06 10:27:01 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000 2024-03-06 10:27:01 INFO: Loading: pos 2024-03-06 10:27:01 DEBUG: With settings: 2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/pos/sq_nel_nocharlm_tagger.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:01 DEBUG: Loading pretrain /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:02 DEBUG: Loaded pretrain from /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001 2024-03-06 10:27:03 INFO: Loading: lemma 2024-03-06 10:27:03 DEBUG: With settings: 2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/lemma/sq_nel_nocharlm_lemmatizer.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:03 DEBUG: Building an attentional Seq2Seq model... 2024-03-06 10:27:03 DEBUG: Using a Bi-LSTM encoder 2024-03-06 10:27:03 DEBUG: Using soft attention for LSTM. 2024-03-06 10:27:03 DEBUG: Using POS in encoder 2024-03-06 10:27:03 DEBUG: Finetune all embeddings. 2024-03-06 10:27:03 DEBUG: Running seq2seq lemmatizer with edit classifier... 2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000 2024-03-06 10:27:03 INFO: Loading: depparse 2024-03-06 10:27:03 DEBUG: With settings: 2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/depparse/sq_nel_nocharlm_parser_checkpoint.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'} 2024-03-06 10:27:03 DEBUG: Reusing pretrain /corpus/saved_models/pretrain/fasttextwiki.pt 2024-03-06 10:27:04 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001 2024-03-06 10:27:05 INFO: Done loading processors! Reading: /corpus/texts/100Years_Albanian.txt Starting parser... endminiciep+ string found Parsing miniciep+ 2024-03-06 10:27:19 DEBUG: 6 batches created. 2024-03-06 10:27:22 DEBUG: 450 batches created. 2024-03-06 10:27:22 DEBUG: 127 batches created. Traceback (most recent call last): File "/tools/ud-stanza-ciep.py", line 119, in main() File "/tools/ud-stanza-ciep.py", line 114, in main parseciep(nlp,file_content,filename,args.target,args.miniciep) File "/tools/parsing/stanza_parser.py", line 80, in parseciep miniciep = nlp(preparetext(splitciep[0])) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call return self.process(doc, processors) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process doc = process(doc) File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/depparse_processor.py", line 57, in process raise ValueError("POS not run before depparse!") ValueError: POS not run before depparse!

but the pos processor is actually loaded! bonus question: what is the difference between

depparse

│ ├── sq_nel_nocharlm_parser_checkpoint.pt │ └── sq_nel_nocharlm_parser.pt

AngledLuffa commented 4 months ago

The _checkpoint files include the optimizer and have the most recent state of the optimizer, even if the dev scores of the latest model didn't go up and therefore the main save file wasn't updated. You'll notice that the non-checkpoint file is much smaller than the checkpoint file... that's the optimizer. You can restart a training run that got interrupted in the middle, although if it got interrupted while saving the checkpoint file, you're probably screwed (something we should address)

I can see that you're loading the POS model first before the depparse. Sanity check first - is the POS model labeling either upos or xpos? If somehow it was trained to only label the features, I could see it throwing this kind of error. Otherwise, it really looks from the code that this particular error should happen - it only triggers if both upos and xpos are missing for a word.

        if any(word.upos is None and word.xpos is None for sentence in document.sentences for word in sentence.words):
            raise ValueError("POS not run before depparse!")

If the POS model should be working, what happens if you run the pipeline without the depparse and print out the results? Are there any sentences for which the POS is actually missing?

I wonder if that can happen if the POS model has blank tags in the dataset it's learning from

rahonalab commented 4 months ago

Many thanks for the detailed answer! This is really strange, I have tried to load the pipeline as I do in the script and it worked correctly on a few sentence. I have also tried to pass to the script a small txt file with some sentences and it worked too. But then I try to work on these txt files, as I did in the past, and it throws the error. I assume there's something in these sentences like an unknown word that triggers the error, how can I circumvent it? The model I am using is highly experimental, so I expect that it misses a lot of things. But, again. this is strange. I have trained in the past models on very small data and they worked correctly on this dataset I am trying to parse.

AngledLuffa commented 4 months ago

The model I am using is highly experimental, so I expect that it misses a lot of things

If it "misses" things to be incorrect, that's one thing. But I do very much wonder why it would label anything None.

Are you able to send the data + the data you are trying to test on, or maybe just send the model and the test data? I'd really like to see it in action myself to debug this issue.

Another possible debugging step would be to examine the output of just the tokenizer and the POS w/o any of the subsequent models and check for any words which are missing both xpos and upos.

AngledLuffa commented 4 months ago

Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.

Is this something you want to fix on your end?

Maybe the tagger is supposed to ignore those items, or learn to tag them with _... not sure which would be more productive

AngledLuffa commented 4 months ago

... to be more precise, it IS learning to tag words w/o tags with _, and then the pipeline itself treats that the same as a blank tag.

rahonalab commented 4 months ago

Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.

Is this something you want to fix on your end?

thing is, I have already used these data to train a model two or three times last November and it worked fine. I have just added a few sentences for teaching the parser to recognize mwt like Albanian ta = të + e. I try to run the parser without depparse, and let you know...

AngledLuffa commented 4 months ago

It will successfully train a tagger even if there are empty tags. However, it's learned to recognize some words as having the empty tag, and that's the label the tagger gives those words. Did I express that clearly? I did the following experiment. Instead of sentences such as this in English, where the gets the tags DET and DT

22      which   which   PRON    WDT     PronType=Rel    26      obj     20:ref  _
23      they    they    PRON    PRP     Case=Nom|Number=Plur|Person=3|PronType=Prs      26      nsubj   26:nsubj        _
24      should  should  AUX     MD      VerbForm=Fin    26      aux     26:aux  _
25      have    have    AUX     VB      VerbForm=Inf    26      aux     26:aux  _
26      left    leave   VERB    VBN     Tense=Past|VerbForm=Part        20      acl:relcl       20:acl:relcl    _
27      in      in      ADP     IN      _       29      case    29:case _
28      the     the     DET       DT       Definite=Def|PronType=Art       29      det     29:det  _
29      car     car     NOUN    NN      Number=Sing     26      obl     26:obl:in       SpaceAfter=No

I changed all instances of the to _, so

22      which   which   PRON    WDT     PronType=Rel    26      obj     20:ref  _
23      they    they    PRON    PRP     Case=Nom|Number=Plur|Person=3|PronType=Prs      26      nsubj   26:nsubj        _
24      should  should  AUX     MD      VerbForm=Fin    26      aux     26:aux  _
25      have    have    AUX     VB      VerbForm=Inf    26      aux     26:aux  _
26      left    leave   VERB    VBN     Tense=Past|VerbForm=Part        20      acl:relcl       20:acl:relcl    _
27      in      in      ADP     IN      _       29      case    29:case _
28      the     the     _       _       Definite=Def|PronType=Art       29      det     29:det  _
29      car     car     NOUN    NN      Number=Sing     26      obl     26:obl:in       SpaceAfter=No

Now the tagger I trained labels the with blank tags, which would trigger this error in the dependency parser, since it isn't expecting to receive blank tags.

I think it might make more sense to either throw an error when training a tagger on a partially complete file, or possibly treat single blank tags as masked out. Learning to recognize the blank tag doesn't seem very useful...

In the meantime, if you find and eliminate those blank tags from your dataset, I believe this error will go away.

rahonalab commented 4 months ago

ok, I have successfully parsed a file with just the pos tagging. Indeed, there are some tokens without UPOS. Actually, just one i.e., the stupid " punctuation 🔝 I have the same error in the training data, I'll correct and the error will likely go away. Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌

AngledLuffa commented 4 months ago

Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌

Indeed. I just need to figure out what the right approach is. The two leading candidates in my mind are to stop the tagger from training if there are blank UPOS, so as to give the user a chance to go back and fix the issue, or to treat the blanks as unlabeled tokens in the tagger which don't get a label of any kind.

The second one is more appealing to me ideologically, but the problem is that in a case similar to yours where maybe all the punctuation was unlabeled, then they would all get tagged with the most likely known tag at test time (perhaps NOUN, for example).

If you have an alternate suggestion, happy to hear it.

rahonalab commented 4 months ago

I have corrected the dataset, retrained the model and now the parser works fine. You might insert something in the dataset prepare process, telling the user that is training a model on 'wrong' data...

AngledLuffa commented 2 months ago

This error message is now part of the 1.8.2 release. Is there anything else you need addressed?

rahonalab commented 2 months ago

great! thank you, everything looks good!

stanfordnlp / stanza

New model for unsupported language (Albanian: sq) #1360

Processor-specific arguments are set with keys "{processorname}{argument_name}"