Training POS with word vectors

stanfordnlp / stanza-train

Model training tutorials for the Stanza Python NLP Library

https://stanfordnlp.github.io/stanza/

37 stars 16 forks source link

Training POS with word vectors #18

Closed dmetola closed 9 months ago

dmetola commented 10 months ago

Hi,

I am in the process of training the POS model, and I think I have the correct word vectors and word embeddings for this task.

Currently, I have:

.txt and .xz for the word vectors, stored in the corresponding directory (following the format of the toy data provided)
.pt embeddings from the word vectors
.bin file from the word vectors

What I'm struggling with is to place the .pt file in the corresponding directory. I have followed the same directory structure in stanza_resources as for English, and I have the .pt located in the pretrain subfolder in that directory.

When I run this command

python3 -m stanza.utils.training.run_pos UD_English-TEST --max_steps 500

I'm getting the following error


  warnings.warn(
2024-01-30 14:26:59 INFO: Training program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Old_English-TEST --max_steps 5000
2024-01-30 14:26:59 DEBUG: UD_Old_English-TEST: ang_test
2024-01-30 14:26:59 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-01-30 14:26:59 INFO: UD_Old_English-TEST: saved_models/pos/ang_test_nocharlm_tagger.pt does not exist, training new model
2024-01-30 14:26:59 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-01-30 14:26:59 INFO: Running train POS for UD_Old_English-TEST with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/pos/ang_test.train.in.conllu', '--output_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpj0p055j0', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'train', '--eval_file', '../data/processed/pos/ang_test.dev.in.conllu', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--max_steps', '5000']
2024-01-30 14:26:59 INFO: Running tagger in train mode
2024-01-30 14:26:59 INFO: Loading data with batch size 250...
2024-01-30 14:26:59 INFO: Reading ../data/processed/pos/ang_test.train.in.conllu
2024-01-30 14:27:00 INFO: Train File ../data/processed/pos/ang_test.train.in.conllu, Data Size: 1287
2024-01-30 14:27:00 WARNING: ang_test is not a known dataset.  Examining the data to choose which xpos vocab to use
2024-01-30 14:27:00 INFO: Original length = 1287
2024-01-30 14:27:00 INFO: Filtered length = 1287
2024-01-30 14:27:00 WARNING: Chose XPOSDescription(xpos_type=<XPOSType.WORD: 2>, sep=None) for the xpos factory for ang_test
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 135, in <module>
    main()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 132, in main
    common.main(run_treebank, "pos", "tagger", add_pos_args, tagger.build_argparse(), build_model_filename=build_model_filename, choose_charlm_method=choose_pos_charlm)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/common.py", line 183, in main
    run_treebank(mode, paths, treebank, short_name,
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 96, in run_treebank
    tagger.main(train_args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 143, in main
    train(args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 234, in train
    vocab, train_data, train_batches = load_training_data(args, pretrain)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 191, in load_training_data
    vocab = Dataset.init_vocab(train_docs, args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 70, in init_vocab
    featsvocab = FeatureVocab(data, args['shorthand'], idx=3)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/vocab.py", line 42, in __init__
    super().__init__(data, lang, idx=idx, sep=sep, keyed=keyed)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/vocab.py", line 111, in __init__
    super().__init__(data, lang, idx=idx)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/vocab.py", line 28, in __init__
    self.build_vocab()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/vocab.py", line 166, in build_vocab
    parts = self.unit2parts(u)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/vocab.py", line 125, in unit2parts
    raise ValueError('Received "%s" for a dictionary which is supposed to be keyed, eg the entries should all be of the form key=value and separated by %s' % (unit, self.sep))
ValueError: Received "passivevoice" for a dictionary which is supposed to be keyed, eg the entries should all be of the form key=value and separated by |```

My questions about that issue are the following:

In what file should I be checking that error?
Is the .pt file in the correct directory? Should I store it in further subfolders when I start training the depparser and ner?

Thanks!

AngledLuffa commented 10 months ago

This isn't a problem that will show up in the .pt file, I would say. It's in the raw data, and you can search for it somewhere in data/pos/ang_test.train... Something in the features is in the wrong format.

File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 70, in init_vocab
    featsvocab = FeatureVocab(data, args['shorthand'], idx=3)

I agree the error is extremely opaque, though. I could easily change it to tell you what sentence number to look in, but I'm not sure how much that would help. What would actually help would be the line number, but that's kind of lost by that step of processing the training data.

Basically there's a POS feature which is just passivevoice instead of the proper UD format of Voice=Pass. That'll be in the 6th column of data in your conllu training file

dmetola commented 10 months ago

Thanks for your response, I have sorted that out, luckily there wasn't too much to change.

When running the command line again, I'm getting the following:

/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
2024-01-31 16:29:02 INFO: Training program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Old_English-TEST --max_steps 5000
2024-01-31 16:29:03 DEBUG: UD_Old_English-TEST: ang_test
2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-01-31 16:29:03 INFO: UD_Old_English-TEST: saved_models/pos/ang_test_nocharlm_tagger.pt does not exist, training new model
2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-01-31 16:29:03 INFO: Running train POS for UD_Old_English-TEST with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/pos/ang_test.train.in.conllu', '--output_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpbud4i8ec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'train', '--eval_file', '../data/processed/pos/ang_test.dev.in.conllu', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--max_steps', '5000']
2024-01-31 16:29:03 INFO: Running tagger in train mode
2024-01-31 16:29:03 INFO: Loading data with batch size 250...
2024-01-31 16:29:03 INFO: Reading ../data/processed/pos/ang_test.train.in.conllu
2024-01-31 16:29:03 INFO: Train File ../data/processed/pos/ang_test.train.in.conllu, Data Size: 1287
2024-01-31 16:29:03 WARNING: ang_test is not a known dataset.  Examining the data to choose which xpos vocab to use
2024-01-31 16:29:03 INFO: Original length = 1287
2024-01-31 16:29:03 INFO: Filtered length = 1287
2024-01-31 16:29:03 WARNING: Chose XPOSDescription(xpos_type=<XPOSType.WORD: 2>, sep=None) for the xpos factory for ang_test
2024-01-31 16:29:04 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 135, in <module>
    main()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 132, in main
    common.main(run_treebank, "pos", "tagger", add_pos_args, tagger.build_argparse(), build_model_filename=build_model_filename, choose_charlm_method=choose_pos_charlm)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/common.py", line 183, in main
    run_treebank(mode, paths, treebank, short_name,
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 96, in run_treebank
    tagger.main(train_args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 143, in main
    train(args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 234, in train
    vocab, train_data, train_batches = load_training_data(args, pretrain)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in load_training_data
    train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in <listcomp>
    train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 47, in __init__
    self.pretrain_vocab = pretrain.vocab
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 44, in vocab
    self.load()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 58, in load
    if 'emb' not in data or 'vocab' not in data:
  File "/Users/dario/Library/Python/3.9/lib/python/site-packages/torch/_tensor.py", line 1061, in __contains__
    raise RuntimeError(
RuntimeError: Tensor.__contains__ only supports Tensor or scalar, but you passed in a <class 'str'>.

Could it be that my .pt file is in the wrong format? Or is there anything else I'm missing in the code?

Thanks!

AngledLuffa commented 10 months ago

Yes, I would be suspicious of the embeddings file here. Probably another error that can be made more human readable. Are you able to share that embeddings file?

On Wed, Jan 31, 2024, 8:34 AM Dario Metola Rodriguez < @.***> wrote:

Thanks for your response, I have sorted that out, luckily there wasn't too much to change.

When running the command line again, I'm getting the following:

/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-01-31 16:29:02 INFO: Training program called with: /Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Old_English-TEST --max_steps 5000 2024-01-31 16:29:03 DEBUG: UD_Old_English-TEST: ang_test 2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file 2024-01-31 16:29:03 INFO: UD_Old_English-TEST: saved_models/pos/ang_test_nocharlm_tagger.pt does not exist, training new model 2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file 2024-01-31 16:29:03 INFO: Running train POS for UD_Old_English-TEST with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/pos/ang_test.train.in.conllu', '--output_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpbud4i8ec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'train', '--eval_file', '../data/processed/pos/ang_test.dev.in.conllu', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--max_steps', '5000'] 2024-01-31 16:29:03 INFO: Running tagger in train mode 2024-01-31 16:29:03 INFO: Loading data with batch size 250... 2024-01-31 16:29:03 INFO: Reading ../data/processed/pos/ang_test.train.in.conllu 2024-01-31 16:29:03 INFO: Train File ../data/processed/pos/ang_test.train.in.conllu, Data Size: 1287 2024-01-31 16:29:03 WARNING: ang_test is not a known dataset. Examining the data to choose which xpos vocab to use 2024-01-31 16:29:03 INFO: Original length = 1287 2024-01-31 16:29:03 INFO: Filtered length = 1287 2024-01-31 16:29:03 WARNING: Chose XPOSDescription(xpos_type=<XPOSType.WORD: 2>, sep=None) for the xpos factory for ang_test 2024-01-31 16:29:04 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 135, in main() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 132, in main common.main(run_treebank, "pos", "tagger", add_pos_args, tagger.build_argparse(), build_model_filename=build_model_filename, choose_charlm_method=choose_pos_charlm) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/common.py", line 183, in main run_treebank(mode, paths, treebank, short_name, File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 96, in run_treebank tagger.main(train_args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 143, in main train(args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 234, in train vocab, train_data, train_batches = load_training_data(args, pretrain) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in load_training_data train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 47, in init self.pretrain_vocab = pretrain.vocab File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 44, in vocab self.load() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 58, in load if 'emb' not in data or 'vocab' not in data: File "/Users/dario/Library/Python/3.9/lib/python/site-packages/torch/_tensor.py", line 1061, in contains raise RuntimeError( RuntimeError: Tensor.contains only supports Tensor or scalar, but you passed in a <class 'str'>.```

Could it be that my .pt file is in the wrong format? Or is there anything else I'm missing in the code?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1919474409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLCUFVRKV5Q3U36TV3YRJXDBAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGQ3TINBQHE . You are receiving this because you commented.Message ID: @.***>

dmetola commented 10 months ago

Thanks for your response, GitHub won't allow the format to be attached, so I have uploaded it to WeTransfer. Here's the link:

https://we.tl/t-tYD3iREXMh

Thanks!

AngledLuffa commented 10 months ago

Yes, the issue here is this is clearly not a PT embedding file as constructed by Stanza. You could use the tool in stanza/models/common/pretrain.py or stanza/models/common/convert_pretrain.py to convert the embeddings you have into a Stanza embedding file. The thing missing is the vocab, without which it's just a big table of numbers :/

I'll update that error to do more than just barf with a KeyError

On Wed, Jan 31, 2024 at 9:38 AM Dario Metola Rodriguez < @.***> wrote:

Thanks for your response, GitHub won't allow the format to be attached, so I have uploaded it to WeTransfer. Here's the link:

https://we.tl/t-tYD3iREXMh

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1919590891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMFFDLJAL56T4NQQR3YRJ6PTAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGU4TAOBZGE . You are receiving this because you commented.Message ID: @.***>

dmetola commented 9 months ago

Hi,

Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears

dario@192 VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
2024-02-07 16:56:34 INFO: Datasets program called with:
/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST
2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s]                                                                                     
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in <module>
    main()
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main
    common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank
    tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model
    download(lang=short_language, package=None, processors={"pos": dataset})
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download
    raise UnknownLanguageError(lang)
stanza.resources.common.UnknownLanguageError: Unknown language requested: ang

Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.

https://we.tl/t-YHjynb15BN

Thank you very much for your help in this!

AngledLuffa commented 9 months ago

I know exactly what the problem is. It's trying to download the tagger and not recognizing that you already have a trained tagger. I should be able to fix that up later today

On Wed, Feb 7, 2024 at 9:01 AM Dario Metola Rodriguez < @.***> wrote:

Hi,

Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears

@.*** VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST 2024-02-07 16:56:34 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s] Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model download(lang=short_language, package=None, processors={"pos": dataset}) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download raise UnknownLanguageError(lang) stanza.resources.common.UnknownLanguageError: Unknown language requested: ang

Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.

https://we.tl/t-YHjynb15BN

Thank you very much for your help in this!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1932489639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNY7EKKU44F23X4RRDYSOXOVAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGQ4DSNRTHE . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 9 months ago

Had you saved a previously trained tagger model somewhere other than "saved_models/pos"? If so, you can update the path with the "--save_dir" flag when using prepare_depparse_treebank. You can also give it the "--tagger_model" flag to pick a specific tagger. Alternatively, you can just use the gold tags with "--gold" to not try to predict the tags.

I updated the dev branch to hopefully make the error more clear.

On Wed, Feb 7, 2024 at 9:28 AM John Bauer @.***> wrote:

I know exactly what the problem is. It's trying to download the tagger and not recognizing that you already have a trained tagger. I should be able to fix that up later today

On Wed, Feb 7, 2024 at 9:01 AM Dario Metola Rodriguez < @.***> wrote:

Hi,

Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears

@.*** VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST 2024-02-07 16:56:34 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s] Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model download(lang=short_language, package=None, processors={"pos": dataset}) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download raise UnknownLanguageError(lang) stanza.resources.common.UnknownLanguageError: Unknown language requested: ang

Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.

https://we.tl/t-YHjynb15BN

Thank you very much for your help in this!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1932489639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNY7EKKU44F23X4RRDYSOXOVAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGQ4DSNRTHE . You are receiving this because you commented.Message ID: @.***>

dmetola commented 9 months ago

Hi,

Thanks for your message.

I followed your instructions, and now that issue is sorted.

When preparing the depparser to train, now I'm getting a new error:

dario@192 stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 23, in <module>
    from stanza.utils.training.run_pos import pos_batch_size, wordvec_args
ImportError: cannot import name 'pos_batch_size' from 'stanza.utils.training.run_pos' (/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py)

Thanks!

AngledLuffa commented 9 months ago

Is the source tree you are using up to date with the current dev branch? I am not seeing that error on the dev branch. Furthermore, the imports are different at line 23 in my git clone

On Wed, Feb 14, 2024 at 8:17 AM Dario Metola Rodriguez < @.***> wrote:

Hi,

Thanks for your message.

I followed your instructions, and now that issue is sorted.

When preparing the depparser to train, now I'm getting a new error:

@.*** stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST /Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 23, in from stanza.utils.training.run_pos import pos_batch_size, wordvec_args ImportError: cannot import name 'pos_batch_size' from 'stanza.utils.training.run_pos' (/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py)

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1944160228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJIGNQJBH2UEDLOEY3YTTPTNAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGE3DAMRSHA . You are receiving this because you commented.Message ID: @.***>

dmetola commented 9 months ago

I have pulled the dev branch again, that error doesn't appear anymore, but I'm getting the following:

dario@192 stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
2024-02-14 17:10:12 INFO: Datasets program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST
2024-02-14 17:10:12 INFO: Using tagger model in saved_models/pos/ang_test_nocharlm_tagger.pt for ang_test
2024-02-14 17:10:12 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt  To use a different pretrain, specify --wordvec_pretrain_file
Preparing data for UD_Old_English-TEST: ang_test, ang
Reading from ../data/udbase/UD_Old_English-TEST/ang_ewt-ud-train.conllu and writing to /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmp9f32g6zu/ang_test.train.gold.conllu
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 143, in <module>
    main()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 139, in main
    common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/common.py", line 276, in main
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 132, in process_treebank
    prepare_tokenizer_treebank.copy_conllu_treebank(treebank, model_type, paths, paths["DEPPARSE_DATA_DIR"], retag_dataset)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 71, in copy_conllu_treebank
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1196, in process_treebank
    process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1090, in process_ud_treebank
    prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1079, in prepare_ud_dataset
    write_augmented_dataset(input_conllu, output_conllu, augment_punct)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 715, in write_augmented_dataset
    new_sents = augment_function(sents)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 694, in augment_punct
    new_sents = augment_apos(sents)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 415, in augment_apos
    raise ValueError("Cannot find '# text' in sentences %d.  First line: %s" % (sent_idx, sent[0]))
ValueError: Cannot find '# text' in sentences 84.  First line: 1        ne      ne ‘not’        PART    particle        Uninflected=Yes 2       nmod    _       _

I have compared my data with the toy data included in the repository, and it is the same format. I have also tried running that command with the toy data, and it works without issues. Just to double check, according to what it says, it checks the prepared data stored in processed/tokenize, am I right? If that's the case, the format is also the same as with the original training data.

AngledLuffa commented 9 months ago

It looks like one of the sentences is missing the "# text" line at the start if the sentences. It's trying to tell you which sentence has that problem

On Wed, Feb 14, 2024, 9:24 AM Dario Metola Rodriguez < @.***> wrote:

I have pulled the dev branch again, that error doesn't appear anymore, but I'm getting the following:

@.*** stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST /Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-02-14 17:10:12 INFO: Datasets program called with: /Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-14 17:10:12 INFO: Using tagger model in saved_models/pos/ang_test_nocharlm_tagger.pt for ang_test 2024-02-14 17:10:12 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file Preparing data for UD_Old_English-TEST: ang_test, ang Reading from ../data/udbase/UD_Old_English-TEST/ang_ewt-ud-train.conllu and writing to /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmp9f32g6zu/ang_test.train.gold.conllu Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 143, in main() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 139, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/common.py", line 276, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 132, in process_treebank prepare_tokenizer_treebank.copy_conllu_treebank(treebank, model_type, paths, paths["DEPPARSE_DATA_DIR"], retag_dataset) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 71, in copy_conllu_treebank process_treebank(treebank, model_type, paths, args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1196, in process_treebank process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1090, in process_ud_treebank prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1079, in prepare_ud_dataset write_augmented_dataset(input_conllu, output_conllu, augment_punct) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 715, in write_augmented_dataset new_sents = augment_function(sents) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 694, in augment_punct new_sents = augment_apos(sents) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 415, in augment_apos raise ValueError("Cannot find '# text' in sentences %d. First line: %s" % (sentidx, sent[0])) ValueError: Cannot find '# text' in sentences 84. First line: 1 ne ne ‘not’ PART particle Uninflected=Yes 2 nmod _

I have compared my data with the toy data included in the repository, and it is the same format. I have also tried running that command with the toy data, and it works without issues. Just to double check, according to what it says, it checks the prepared data stored in processed/tokenize, am I right? If that's the case, the format is also the same as with the original training data.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1944279349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIW5WFH6E4LPBJI6ALYTTXOJAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGI3TSMZUHE . You are receiving this because you commented.Message ID: @.***>

dmetola commented 9 months ago

Seems to be sorted, thanks!