stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

[QUESTION] How to use my own POS model when training a constituency model? #1356

Open ingunnjk opened 4 months ago

ingunnjk commented 4 months ago

I am working on adding a constituency model for Icelandic. I used the constituency treebank I have for training a POS tagger but how do I use it when training the constituency model? The instructions say this: "To change to a specific model (such as if you build one yourself) use the --retag_model_path command line flag." but when I try to run this: "python -m stanza.utils.training.run_constituency is_icepahc --retag_model_path saved_models/pos/is_icepahc_nocharlm_tagger.pt" it still just uses the default pos tagger for Icelandic (which I don't want to use). Do I need to use some more flags, other than --retag_model_path (for example --retag_package?), to make sure it uses my model?

Here is what I get when I only use the flag --retag_model_path saved_models/pos/is_icepahc_nocharlm_tagger.pt: ... retag_method: xpos retag_model_path: saved_models/pos/is_icepahc_nocharlm_tagger.pt retag_package: default retag_pretrain_path: None retag_xpos: True ... And: 2024-02-29 20:46:55 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_train.mrg 2024-02-29 20:47:16 INFO: Read 58394 trees for the training set 2024-02-29 20:47:18 INFO: Filtered 512 duplicates from train dataset 2024-02-29 20:47:18 INFO: Eliminated 3 trees with missing structure 2024-02-29 20:47:18 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg 2024-02-29 20:47:19 INFO: Read 7299 trees for the dev set 2024-02-29 20:47:20 INFO: Filtered 24 duplicates from dev dataset 2024-02-29 20:47:20 INFO: Retagging trees using the xpos tags from the default package... (i.e. not using my model.. and then the training fails after retagging because this default pos tagger is not compatible with my data)

AngledLuffa commented 4 months ago

I actually think that's a case of the logging not correctly reflecting the reality. I added some log lines to the parser which should tell you which model it loaded for the retagging. You can try it using the dev branch, if you're comfortable with that. (I will need to make a new release within a day to fix some other issues, anyway.)

Beyond that, I think there's probably something else going wrong... would you post the entire log message?

Kudos for going ahead with the Icelandic parser. I knew that dataset existed, but had not tried to build a model yet.

On Thu, Feb 29, 2024 at 12:54 PM Ingunn Jóhanna Kristjánsdóttir < @.***> wrote:

I am working on adding a constituency model for Icelandic. I used the constituency treebank I have for training a POS tagger but how do I use it when training the constituency model? The instructions say this: "To change to a specific model (such as if you build one yourself) use the --retag_model_path command line flag." but when I try to run this: "python -m stanza.utils.training.run_constituency is_icepahc --retag_model_path saved_models/pos/ is_icepahc_nocharlm_tagger.pt" it still just uses the default pos tagger for Icelandic (which I don't want to use). Do I need to use some more flags, other than --retag_model_path (for example --retag_package?), to make sure it uses my model?

Here is what I get when I only use the flag --retag_model_path saved_models/pos/is_icepahc_nocharlm_tagger.pt: ... retag_method: xpos retag_model_path: saved_models/pos/is_icepahc_nocharlm_tagger.pt retag_package: default retag_pretrain_path: None retag_xpos: True ... And: 2024-02-29 20:46:55 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_train.mrg 2024-02-29 20:47:16 INFO: Read 58394 trees for the training set 2024-02-29 20:47:18 INFO: Filtered 512 duplicates from train dataset 2024-02-29 20:47:18 INFO: Eliminated 3 trees with missing structure 2024-02-29 20:47:18 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg 2024-02-29 20:47:19 INFO: Read 7299 trees for the dev set 2024-02-29 20:47:20 INFO: Filtered 24 duplicates from dev dataset 2024-02-29 20:47:20 INFO: Retagging trees using the xpos tags from the default package... (i.e. not using my model.. and then the training fails after retagging because this default pos tagger is not compatible with my data)

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOC5ANTDKGII5Q4RJTYV6KRXAVCNFSM6AAAAABEAUJ6QKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DEMBYHEZDOMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa commented 4 months ago

It's also possible that the failure you saw was related to the same error found here: https://github.com/stanfordnlp/stanza/issues/1357

If you could verify if the dev branch fixes your problem, that would be great. If using the dev branch is difficult, posting the stack trace for the error you ran into would also help. I have to make a new release with a fix for issue 1357, and if the existing fix doesn't also address your problem, I can try to fix that as well.

AngledLuffa commented 4 months ago

Were you able to make progress on this with the updated version?

ingunnjk commented 4 months ago

Hi, sorry for the late answer! I tried using the dev branch and that didn't seem to change much, unfortunately. Here is the entire log message:

2024-03-05 13:56:43 INFO: Training program called with:
/Users/ingunnkristjansdottir/stanza/stanza/utils/training/run_constituency.py is_icepahc --retag_model_path /Users/ingunnkristjansdottir/saved_models/pos/is_icepahc_nocharlm_tagger.pt
2024-03-05 13:56:43 INFO: Using default pretrain for language, found in /Users/ingunnkristjansdottir/stanza_resources/is/pretrain/fasttext157.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-05 13:56:43 WARNING: Multistage training is set.  Best models are with MADGRAD, but it is not installed.  Will use AdamW for the second stage optimizer.  Consider installing MADGRAD
2024-03-05 13:56:43 INFO: Expanded save_name: is_icepahc_nocharlm_constituency.pt
2024-03-05 13:56:43 INFO: Expanded save_name: saved_models/constituency/is_icepahc_nocharlm_constituency.pt
2024-03-05 13:56:43 INFO: is_icepahc: saved_models/constituency/is_icepahc_nocharlm_constituency.pt does not exist, training new model
2024-03-05 13:56:43 INFO: Using default pretrain for language, found in /Users/ingunnkristjansdottir/stanza_resources/is/pretrain/fasttext157.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-05 13:56:43 INFO: Running train step with args: ['--train_file', '/Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_train.mrg', '--eval_file', '/Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg', '--shorthand', 'is_icepahc', '--mode', 'train', '--wordvec_pretrain_file', '/Users/ingunnkristjansdottir/stanza_resources/is/pretrain/fasttext157.pt', '--retag_model_path', '/Users/ingunnkristjansdottir/saved_models/pos/is_icepahc_nocharlm_tagger.pt']
2024-03-05 13:56:43 WARNING: Multistage training is set.  Best models are with MADGRAD, but it is not installed.  Will use AdamW for the second stage optimizer.  Consider installing MADGRAD
2024-03-05 13:56:43 INFO: Expanded save_name: is_icepahc_nocharlm_constituency.pt
2024-03-05 13:56:43 INFO: Running constituency parser in train mode
2024-03-05 13:56:43 DEBUG: Set trainer logging level to DEBUG
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 373kB [00:00, 51.3MB/s]
2024-03-05 13:56:43 INFO: Downloaded file to /Users/ingunnkristjansdottir/stanza_resources/resources.json
2024-03-05 13:56:43 DEBUG: Creating retag pipeline using /Users/ingunnkristjansdottir/saved_models/pos/is_icepahc_nocharlm_tagger.pt
2024-03-05 13:56:45 INFO: Loading these models for language: is (Icelandic):
=======================================
| Processor | Package                 |
---------------------------------------
| tokenize  | icepahc                 |
| pos       | /Users/ing..._tagger.pt |
=======================================

2024-03-05 13:56:45 INFO: Using device: cpu
2024-03-05 13:56:45 INFO: Loading: tokenize
2024-03-05 13:56:45 INFO: Loading: pos
2024-03-05 13:56:45 INFO: Done loading processors!
2024-03-05 13:56:45 INFO: ARGS USED AT TRAINING TIME:
additional_oracle_levels: None
bert_finetune: False
bert_finetune_begin_epoch: None
bert_finetune_end_epoch: None
bert_finetune_layers: None
bert_hidden_layers: 4
bert_learning_rate: 0.009
bert_model: None
bert_weight_decay: 0.0001
charlm_backward_file: None
charlm_forward_file: None
check_valid_states: True
checkpoint: True
checkpoint_save_name: saved_models/constituency/is_icepahc_nocharlm_constituency_checkpoint.pt
combined_dummy_embedding: True
constituency_composition: ConstituencyComposition.MAX
constituent_heads: 8
constituent_stack: StackHistory.LSTM
data_dir: data/constituency
delta_embedding_dim: 100
device: cpu
epoch_size: 5000
epochs: 400
eval_batch_size: 50
eval_file: /Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg
finetune: False
grad_clipping: None
hidden_size: 512
lang: is
lattn_attention_dropout: 0.2
lattn_combine_as_self: False
lattn_combined_input: True
lattn_d_ff: 2048
lattn_d_input_proj: None
lattn_d_kv: 64
lattn_d_l: 32
lattn_d_proj: 64
lattn_partitioned: True
lattn_pwff: True
lattn_q_as_matrix: False
lattn_relu_dropout: 0.2
lattn_resdrop: True
lattn_residual_dropout: 0.2
learning_beta2: 0.999
learning_eps: 1e-08
learning_momentum: None
learning_rate: 0.0002
learning_rate_cooldown: 10
learning_rate_factor: 0.6
learning_rate_min_lr: 4.000000000000001e-06
learning_rate_patience: 5
learning_rate_warmup: 0
learning_rho: 0.9
learning_weight_decay: 0.05
load_name: None
load_package: None
log_norms: False
log_shapes: False
lora_alpha: 128
lora_dropout: 0.1
lora_modules_to_save: []
lora_rank: 64
lora_target_modules: ['query', 'value', 'output.dense', 'intermediate.dense']
loss: cross
loss_focal_gamma: 2
lstm_input_dropout: 0.2
lstm_layer_dropout: 0.0
maxout_k: None
mode: train
multistage: True
nonlinearity: relu
num_generate: 0
num_lstm_layers: 2
num_output_layers: 3
num_tree_lstm_layers: 1
optim: adamw
oracle_forced_errors: 0.001
oracle_frequency: 0.8
oracle_initial_epoch: 1
oracle_level: None
pattn_attention_dropout: 0.2
pattn_bias: False
pattn_d_ff: 2048
pattn_d_kv: 64
pattn_d_model: 1024
pattn_encoder_max_len: 512
pattn_morpho_emb_dropout: 0.2
pattn_num_heads: 8
pattn_num_layers: 0
pattn_relu_dropout: 0.1
pattn_residual_dropout: 0.2
pattn_timing: sin
predict_dir: .
predict_dropout: 0.2
predict_file: None
predict_format: {:_O}
pretrain_max_vocab: 250000
rare_word_threshold: 0.02
rare_word_unknown_frequency: 0.02
reduce_heads: 8
reduce_position: 128
relearn_structure: False
retag_charlm_backward_file: None
retag_charlm_forward_file: None
retag_method: xpos
retag_model_path: /Users/ingunnkristjansdottir/saved_models/pos/is_icepahc_nocharlm_tagger.pt
retag_package: default
retag_pretrain_path: None
retag_xpos: True
reversed: False
save_dir: saved_models/constituency
save_each_frequency: 1
save_each_name: saved_models/constituency/is_icepahc_nocharlm_constituency_%04d.pt
save_each_optimizer: True
save_each_start: None
save_name: saved_models/constituency/is_icepahc_nocharlm_constituency.pt
seed: 1234
sentence_boundary_vectors: SentenceBoundary.EVERYTHING
shorthand: is_icepahc
silver_epoch_size: None
silver_file: None
silver_remove_duplicates: False
stage1_bert_finetune: False
stage1_bert_learning_rate: 0.009
stage1_learning_rate: 1.0
stage1_learning_rate_min_lr: 0.02
tag_embedding_dim: 20
tag_unknown_frequency: 0.001
tokenized_dir: None
tokenized_file: None
train_batch_size: 30
train_file: /Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_train.mrg
transition_embedding_dim: 20
transition_heads: 4
transition_hidden_size: 20
transition_scheme: TransitionScheme.IN_ORDER
transition_stack: StackHistory.LSTM
use_lattn: False
use_peft: False
use_silver_words: True
wandb: False
wandb_name: None
wandb_norm_regex: None
watch_regex: None
word_dropout: 0.2
wordvec_dir: extern_data/wordvec
wordvec_file: 
wordvec_pretrain_file: /Users/ingunnkristjansdottir/stanza_resources/is/pretrain/fasttext157.pt

2024-03-05 13:56:45 INFO: Reading trees from /Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_train.mrg
100%|███████████████████████████████████| 58392/58392 [00:14<00:00, 4067.72it/s]
2024-03-05 13:57:17 INFO: Read 58392 trees for the training set
2024-03-05 13:57:21 INFO: Filtered 512 duplicates from train dataset
2024-03-05 13:57:21 INFO: Eliminated 3 trees with missing structure
2024-03-05 13:57:21 INFO: Reading trees from /Users/ingunnkristjansdottir/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg
100%|█████████████████████████████████████| 7300/7300 [00:00<00:00, 7332.35it/s]
2024-03-05 13:57:23 INFO: Read 7300 trees for the dev set
2024-03-05 13:57:24 INFO: Filtered 24 duplicates from dev dataset
2024-03-05 13:57:24 INFO: Retagging trees using the xpos tags from the default package...
100%|████████████████████████████████████| 57877/57877 [08:03<00:00, 119.65it/s]
100%|██████████████████████████████████████| 7276/7276 [00:50<00:00, 143.16it/s]
2024-03-05 14:06:19 INFO: Retagging finished
2024-03-05 14:06:21 INFO: Unique constituents in training set: ['ADJ', 'ADJP', 'ADJP*OC', 'ADP', 'ADV', 'ADVP', 'ADVP*RMP', 'CONJ', 'CONJP', 'CONJP*PP', 'CP', 'CP*AUE*ADV', 'CP*THT*NaN', 'CP*THT1', 'DET', 'FOREIGN', 'FRAG', 'FS', 'IMP*IMP', 'INTJP', 'IP', 'IP*IMP*SBJ', 'IP*INF', 'IP*MAT*KOMINN', 'IP*MAT*SENT*BEFORE', 'IP*MAT*SMC', 'IP*MAT*SUB', 'IP*OB1', 'IP*OB2', 'IP*PRD', 'IP*SBJ', 'IP*SUB', 'IP*SUB*INF', 'IP*SUB3', 'META', 'NOUN', 'NP', 'NP*AB1', 'NP*LLL', 'NP*NPR', 'NP*NUM', 'NP*OB1', 'NP*OB2', 'NP*PRD', 'NP*SBJ', 'NP*SMC', 'NS', 'NUM', 'NUMP', 'PP', 'PRT', 'Q+NUM*A', 'Q+NUM*N', 'QP', 'QTP', 'REF', 'ROOT', 'ROOTTOP', 'RRC', 'TRANSLATION', 'VERB', 'VP', 'WADJP', 'WADVP', 'WADVP*NaN', 'WNP', 'WNP*COM', 'WNP*POS', 'WPP', 'WQP', 'X', 'XP']
Traceback (most recent call last):
  File "/Users/ingunnkristjansdottir/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/ingunnkristjansdottir/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/ingunnkristjansdottir/stanza/stanza/utils/training/run_constituency.py", line 124, in <module>
    main()
  File "/Users/ingunnkristjansdottir/stanza/stanza/utils/training/run_constituency.py", line 121, in main
    common.main(run_treebank, "constituency", "constituency", add_constituency_args, sub_argparse=constituency_parser.build_argparse(), build_model_filename=build_model_filename)
  File "/Users/ingunnkristjansdottir/stanza/stanza/utils/training/common.py", line 183, in main
    run_treebank(mode, paths, treebank, short_name,
  File "/Users/ingunnkristjansdottir/stanza/stanza/utils/training/run_constituency.py", line 95, in run_treebank
    constituency_parser.main(train_args)
  File "/Users/ingunnkristjansdottir/stanza/stanza/models/constituency_parser.py", line 851, in main
    trainer.train(args, model_load_file, retag_pipeline)
  File "/Users/ingunnkristjansdottir/stanza/stanza/models/constituency/trainer.py", line 703, in train
    trainer, train_sequences, silver_sequences, train_transitions = build_trainer(args, train_trees, dev_trees, silver_trees, foundation_cache, model_load_file)
  File "/Users/ingunnkristjansdottir/stanza/stanza/models/constituency/trainer.py", line 486, in build_trainer
    check_constituents(train_constituents, dev_trees, "dev")
  File "/Users/ingunnkristjansdottir/stanza/stanza/models/constituency/trainer.py", line 469, in check_constituents
    raise RuntimeError("Found label {} in the {} set which don't exist in the train set".format(con, treebank_name))
RuntimeError: Found label PRON in the dev set which don't exist in the train set

It's still using the xpos tags from the default package which are not compatible with the data I am trying to train on.

AngledLuffa commented 4 months ago

That's actually not complaining about a POS tag, but rather a constituent tag. I will update the error to make it more clear. You can either check in the treebank for a tree with such a typo, or you can give it the --no_check_valid_states flag to make it avoid checking for that entirely. We don't know how frequently that happens, or if it's a result of an error of some sort in the retagging pipeline, so it is preferred to check in the dataset for any possible errors.

I suppose it might be useful to have it report which tree is causing the error... I'm short on time now, but I can probably do that tonight. Are you comfortable using the dev branch? If not, I can put it on testpypi or something. Either way, we should make it so that it's easier to diagnose this problem.

One other thing I notice is that the treebank is using * as the separator, so the parser will try to predict ADJP*OC as a different constituent from ADJP. This will probably noticeably affect its accuracy and its scoring. In general we've been cutting off those functional tags. I can add * to the list of functional tags to cut off, unless you think of some reason not to do that

AngledLuffa commented 4 months ago

Alright, I added what is hopefully a very thorough message for when the constituent checker fails. Please let us know what information it gives you. If using the dev branch of Stanza isn't in your wheelhouse, I'll put a snapshot on testpypi (I don't think this quite qualifies for a version 1.8.2)

ingunnjk commented 4 months ago

Great, thanks! The thorough message for when the constituents checker fails helped me figure out what the problem was and fix it! And yes, it would probably be a good idea to add * to the list of functional tags to cut off.

AngledLuffa commented 4 months ago

Glad to hear that helped! I added * to the list of functional tags in the dev branch. I don't believe any of the treebanks we currently use have * as a relevant piece of tag

One thing to try for improved accuracy is that the overall model will have much higher accuracy with a transformer. I know IceBERT and ScandiBert are a couple possible options you can download from HF using the --bert_model flag to the constituency parser.

You can also finetune those transformers specifically for the constituency task. That uses quite a lot of disk space and GPU memory, of course. The best settings I found so far are in the flag --stage1_bert_finetune

Last random comment for now - I improved the TOP_DOWN dynamic oracle quite a bit in the last month or so, and although I haven't made it the default yet, I find that it's actually more accurate that the default IN_ORDER transition scheme. You can try that with --transition_scheme TOP_DOWN or --transition_scheme IN_ORDER

ingunnjk commented 3 months ago

Hi again and thank you for the help! I have now successfully trained a constituency model for Icelandic (it gets 90.63 on the test set, which is currently the best score for an Icelandic constituency parser), would it be possible to add it to Stanza?

AngledLuffa commented 3 months ago

Yes, that would be great! In general, if there were any code changes to convert the original Icelandic annotations to the format usable by the parser, the first step would be a PR which adds that script to stanza/utils/datasets/constituency

After that, I know there are a couple different transformers on HF which include Icelandic. If you've tried with those models, and have some notes on which ones give which scores, that would also be helpful. We haven't integrated those yet into the IS pipelines: stanza/resources/default_packages.py

Did you mention having your own POS tagging? That might be relevant, if you think the POS tagger is more useful than one built from the IS Universal Dependencies treebanks

ingunnjk commented 2 months ago

Great, I will start working on that in the next few days! I did not have my own POS tagging in the end and I used IceBERT from HF. I got the best results with the flags --bert_model mideind/IceBERT --stage1_bert_finetune --transition_scheme TOP_DOWN.

Sentences in the IcePaHC treebank are divided into matrix clauses and the previous parsing pipelines for Icelandic text that have been trained on IcePaHC (https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf pages 48-51 and https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings.pdf pages 138-141) do matrix clause boundary detection before the parsing. They use this tool for the matrix clause boundary detection: https://github.com/antonkarl/iceParsingPipeline/tree/4d8e65958e7ebc9d28ab463ba27ffcbb895e6f1c/tools/splitter. I was wondering if it would be possible to add this to the Stanza pipeline for Icelandic text so that users don't have to run the splitter on their input text themselves before parsing it with Stanza?

AngledLuffa commented 2 months ago

It's probably easier to have used the UD POS tags! I have also found that the TOP_DOWN model is working better with the upgraded dynamic oracle. Perhaps I should revisit the IN_ORDER oracle to see if it can be improved.

matrix clauses

Hmm, that's effectively a constraint on the parse structure built by the model, right? I haven't implemented that in the constituency parser at all. It wouldn't happen any time soon, either. Another possibility would be to use the parser itself to extract those clauses, or to try doing that and see how accurate it is compared to gold annotations or the splitting tool.