Closed dmetola closed 8 months ago
This isn't a problem that will show up in the .pt file, I would say. It's in the raw data, and you can search for it somewhere in data/pos/ang_test.train... Something in the features is in the wrong format.
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 70, in init_vocab
featsvocab = FeatureVocab(data, args['shorthand'], idx=3)
I agree the error is extremely opaque, though. I could easily change it to tell you what sentence number to look in, but I'm not sure how much that would help. What would actually help would be the line number, but that's kind of lost by that step of processing the training data.
Basically there's a POS feature which is just passivevoice
instead of the proper UD format of Voice=Pass
. That'll be in the 6th column of data in your conllu training file
Thanks for your response, I have sorted that out, luckily there wasn't too much to change.
When running the command line again, I'm getting the following:
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
2024-01-31 16:29:02 INFO: Training program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Old_English-TEST --max_steps 5000
2024-01-31 16:29:03 DEBUG: UD_Old_English-TEST: ang_test
2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file
2024-01-31 16:29:03 INFO: UD_Old_English-TEST: saved_models/pos/ang_test_nocharlm_tagger.pt does not exist, training new model
2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file
2024-01-31 16:29:03 INFO: Running train POS for UD_Old_English-TEST with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/pos/ang_test.train.in.conllu', '--output_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpbud4i8ec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'train', '--eval_file', '../data/processed/pos/ang_test.dev.in.conllu', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--max_steps', '5000']
2024-01-31 16:29:03 INFO: Running tagger in train mode
2024-01-31 16:29:03 INFO: Loading data with batch size 250...
2024-01-31 16:29:03 INFO: Reading ../data/processed/pos/ang_test.train.in.conllu
2024-01-31 16:29:03 INFO: Train File ../data/processed/pos/ang_test.train.in.conllu, Data Size: 1287
2024-01-31 16:29:03 WARNING: ang_test is not a known dataset. Examining the data to choose which xpos vocab to use
2024-01-31 16:29:03 INFO: Original length = 1287
2024-01-31 16:29:03 INFO: Filtered length = 1287
2024-01-31 16:29:03 WARNING: Chose XPOSDescription(xpos_type=<XPOSType.WORD: 2>, sep=None) for the xpos factory for ang_test
2024-01-31 16:29:04 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 135, in <module>
main()
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 132, in main
common.main(run_treebank, "pos", "tagger", add_pos_args, tagger.build_argparse(), build_model_filename=build_model_filename, choose_charlm_method=choose_pos_charlm)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/common.py", line 183, in main
run_treebank(mode, paths, treebank, short_name,
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 96, in run_treebank
tagger.main(train_args)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 143, in main
train(args)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 234, in train
vocab, train_data, train_batches = load_training_data(args, pretrain)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in load_training_data
train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in <listcomp>
train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 47, in __init__
self.pretrain_vocab = pretrain.vocab
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 44, in vocab
self.load()
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 58, in load
if 'emb' not in data or 'vocab' not in data:
File "/Users/dario/Library/Python/3.9/lib/python/site-packages/torch/_tensor.py", line 1061, in __contains__
raise RuntimeError(
RuntimeError: Tensor.__contains__ only supports Tensor or scalar, but you passed in a <class 'str'>.
Could it be that my .pt file is in the wrong format? Or is there anything else I'm missing in the code?
Thanks!
Yes, I would be suspicious of the embeddings file here. Probably another error that can be made more human readable. Are you able to share that embeddings file?
On Wed, Jan 31, 2024, 8:34 AM Dario Metola Rodriguez < @.***> wrote:
Thanks for your response, I have sorted that out, luckily there wasn't too much to change.
When running the command line again, I'm getting the following:
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-01-31 16:29:02 INFO: Training program called with: /Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Old_English-TEST --max_steps 5000 2024-01-31 16:29:03 DEBUG: UD_Old_English-TEST: ang_test 2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file 2024-01-31 16:29:03 INFO: UD_Old_English-TEST: saved_models/pos/ang_test_nocharlm_tagger.pt does not exist, training new model 2024-01-31 16:29:03 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file 2024-01-31 16:29:03 INFO: Running train POS for UD_Old_English-TEST with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/pos/ang_test.train.in.conllu', '--output_file', '/var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmpbud4i8ec', '--lang', 'ang', '--shorthand', 'ang_test', '--mode', 'train', '--eval_file', '../data/processed/pos/ang_test.dev.in.conllu', '--wordvec_pretrain_file', '/Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt', '--max_steps', '5000'] 2024-01-31 16:29:03 INFO: Running tagger in train mode 2024-01-31 16:29:03 INFO: Loading data with batch size 250... 2024-01-31 16:29:03 INFO: Reading ../data/processed/pos/ang_test.train.in.conllu 2024-01-31 16:29:03 INFO: Train File ../data/processed/pos/ang_test.train.in.conllu, Data Size: 1287 2024-01-31 16:29:03 WARNING: ang_test is not a known dataset. Examining the data to choose which xpos vocab to use 2024-01-31 16:29:03 INFO: Original length = 1287 2024-01-31 16:29:03 INFO: Filtered length = 1287 2024-01-31 16:29:03 WARNING: Chose XPOSDescription(xpos_type=<XPOSType.WORD: 2>, sep=None) for the xpos factory for ang_test 2024-01-31 16:29:04 DEBUG: Loaded pretrain from /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 135, in
main() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 132, in main common.main(run_treebank, "pos", "tagger", add_pos_args, tagger.build_argparse(), build_model_filename=build_model_filename, choose_charlm_method=choose_pos_charlm) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/common.py", line 183, in main run_treebank(mode, paths, treebank, short_name, File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py", line 96, in run_treebank tagger.main(train_args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 143, in main train(args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 234, in train vocab, train_data, train_batches = load_training_data(args, pretrain) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in load_training_data train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/tagger.py", line 192, in train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/pos/data.py", line 47, in init self.pretrain_vocab = pretrain.vocab File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 44, in vocab self.load() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/models/common/pretrain.py", line 58, in load if 'emb' not in data or 'vocab' not in data: File "/Users/dario/Library/Python/3.9/lib/python/site-packages/torch/_tensor.py", line 1061, in contains raise RuntimeError( RuntimeError: Tensor.contains only supports Tensor or scalar, but you passed in a <class 'str'>.``` Could it be that my .pt file is in the wrong format? Or is there anything else I'm missing in the code?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1919474409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLCUFVRKV5Q3U36TV3YRJXDBAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGQ3TINBQHE . You are receiving this because you commented.Message ID: @.***>
Thanks for your response, GitHub won't allow the format to be attached, so I have uploaded it to WeTransfer. Here's the link:
Thanks!
Yes, the issue here is this is clearly not a PT embedding file as constructed by Stanza. You could use the tool in stanza/models/common/pretrain.py or stanza/models/common/convert_pretrain.py to convert the embeddings you have into a Stanza embedding file. The thing missing is the vocab, without which it's just a big table of numbers :/
I'll update that error to do more than just barf with a KeyError
On Wed, Jan 31, 2024 at 9:38 AM Dario Metola Rodriguez < @.***> wrote:
Thanks for your response, GitHub won't allow the format to be attached, so I have uploaded it to WeTransfer. Here's the link:
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1919590891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMFFDLJAL56T4NQQR3YRJ6PTAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGU4TAOBZGE . You are receiving this because you commented.Message ID: @.***>
Hi,
Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears
dario@192 VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
2024-02-07 16:56:34 INFO: Datasets program called with:
/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST
2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in <module>
main()
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main
common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args)
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main
process_treebank(treebank, model_type, paths, args)
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank
tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args)
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model
download(lang=short_language, package=None, processors={"pos": dataset})
File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download
raise UnknownLanguageError(lang)
stanza.resources.common.UnknownLanguageError: Unknown language requested: ang
Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.
Thank you very much for your help in this!
I know exactly what the problem is. It's trying to download the tagger and not recognizing that you already have a trained tagger. I should be able to fix that up later today
On Wed, Feb 7, 2024 at 9:01 AM Dario Metola Rodriguez < @.***> wrote:
Hi,
Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears
@.*** VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST 2024-02-07 16:56:34 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s] Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in
main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model download(lang=short_language, package=None, processors={"pos": dataset}) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download raise UnknownLanguageError(lang) stanza.resources.common.UnknownLanguageError: Unknown language requested: ang Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.
Thank you very much for your help in this!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1932489639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNY7EKKU44F23X4RRDYSOXOVAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGQ4DSNRTHE . You are receiving this because you commented.Message ID: @.***>
Had you saved a previously trained tagger model somewhere other than "saved_models/pos"? If so, you can update the path with the "--save_dir" flag when using prepare_depparse_treebank. You can also give it the "--tagger_model" flag to pick a specific tagger. Alternatively, you can just use the gold tags with "--gold" to not try to predict the tags.
I updated the dev branch to hopefully make the error more clear.
On Wed, Feb 7, 2024 at 9:28 AM John Bauer @.***> wrote:
I know exactly what the problem is. It's trying to download the tagger and not recognizing that you already have a trained tagger. I should be able to fix that up later today
On Wed, Feb 7, 2024 at 9:01 AM Dario Metola Rodriguez < @.***> wrote:
Hi,
Thanks for your message. I have used the pretrain script and it seems to work. However, when running the prepare_depparse_treebank script, the following error appears
@.*** VSCode-Projects % python -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST 2024-02-07 16:56:34 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-07 16:56:34 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 62.0MB/s] Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 141, in
main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 137, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 99, in process_treebank tagger_model = choose_tagger_model(short_language, dataset, args.tagger_model, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_depparse_treebank.py", line 78, in choose_tagger_model download(lang=short_language, package=None, processors={"pos": dataset}) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/resources/common.py", line 568, in download raise UnknownLanguageError(lang) stanza.resources.common.UnknownLanguageError: Unknown language requested: ang Does this have something to do with the file, or its naming? If it's not much trouble to ask, I'm also including the WeTransfer link to the .pt file, just to make sure that now it is correct and suitable to be used.
Thank you very much for your help in this!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1932489639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNY7EKKU44F23X4RRDYSOXOVAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGQ4DSNRTHE . You are receiving this because you commented.Message ID: @.***>
Hi,
Thanks for your message.
I followed your instructions, and now that issue is sorted.
When preparing the depparser to train, now I'm getting a new error:
dario@192 stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 23, in <module>
from stanza.utils.training.run_pos import pos_batch_size, wordvec_args
ImportError: cannot import name 'pos_batch_size' from 'stanza.utils.training.run_pos' (/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py)
Thanks!
Is the source tree you are using up to date with the current dev branch? I am not seeing that error on the dev branch. Furthermore, the imports are different at line 23 in my git clone
On Wed, Feb 14, 2024 at 8:17 AM Dario Metola Rodriguez < @.***> wrote:
Hi,
Thanks for your message.
I followed your instructions, and now that issue is sorted.
When preparing the depparser to train, now I'm getting a new error:
@.*** stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST /Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 23, in
from stanza.utils.training.run_pos import pos_batch_size, wordvec_args ImportError: cannot import name 'pos_batch_size' from 'stanza.utils.training.run_pos' (/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/training/run_pos.py) Thanks!
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1944160228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJIGNQJBH2UEDLOEY3YTTPTNAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGE3DAMRSHA . You are receiving this because you commented.Message ID: @.***>
I have pulled the dev branch again, that error doesn't appear anymore, but I'm getting the following:
dario@192 stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST
/Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
2024-02-14 17:10:12 INFO: Datasets program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST
2024-02-14 17:10:12 INFO: Using tagger model in saved_models/pos/ang_test_nocharlm_tagger.pt for ang_test
2024-02-14 17:10:12 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file
Preparing data for UD_Old_English-TEST: ang_test, ang
Reading from ../data/udbase/UD_Old_English-TEST/ang_ewt-ud-train.conllu and writing to /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmp9f32g6zu/ang_test.train.gold.conllu
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 143, in <module>
main()
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 139, in main
common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/common.py", line 276, in main
process_treebank(treebank, model_type, paths, args)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 132, in process_treebank
prepare_tokenizer_treebank.copy_conllu_treebank(treebank, model_type, paths, paths["DEPPARSE_DATA_DIR"], retag_dataset)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 71, in copy_conllu_treebank
process_treebank(treebank, model_type, paths, args)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1196, in process_treebank
process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1090, in process_ud_treebank
prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1079, in prepare_ud_dataset
write_augmented_dataset(input_conllu, output_conllu, augment_punct)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 715, in write_augmented_dataset
new_sents = augment_function(sents)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 694, in augment_punct
new_sents = augment_apos(sents)
File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 415, in augment_apos
raise ValueError("Cannot find '# text' in sentences %d. First line: %s" % (sent_idx, sent[0]))
ValueError: Cannot find '# text' in sentences 84. First line: 1 ne ne ‘not’ PART particle Uninflected=Yes 2 nmod _ _
I have compared my data with the toy data included in the repository, and it is the same format. I have also tried running that command with the toy data, and it works without issues. Just to double check, according to what it says, it checks the prepared data stored in processed/tokenize, am I right? If that's the case, the format is also the same as with the original training data.
It looks like one of the sentences is missing the "# text" line at the start if the sentences. It's trying to tell you which sentence has that problem
On Wed, Feb 14, 2024, 9:24 AM Dario Metola Rodriguez < @.***> wrote:
I have pulled the dev branch again, that error doesn't appear anymore, but I'm getting the following:
@.*** stanza % python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Old_English-TEST /Users/dario/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-02-14 17:10:12 INFO: Datasets program called with: /Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Old_English-TEST 2024-02-14 17:10:12 INFO: Using tagger model in saved_models/pos/ang_test_nocharlm_tagger.pt for ang_test 2024-02-14 17:10:12 INFO: Using pretrain found in /Users/dario/stanza_resources/ang/pretrain/ang_embeddings.pt To use a different pretrain, specify --wordvec_pretrain_file Preparing data for UD_Old_English-TEST: ang_test, ang Reading from ../data/udbase/UD_Old_English-TEST/ang_ewt-ud-train.conllu and writing to /var/folders/4c/zqzwjpjn42xbggh1sw1255nh0000gn/T/tmp9f32g6zu/ang_test.train.gold.conllu Traceback (most recent call last): File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 143, in
main() File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 139, in main common.main(process_treebank, common.ModelType.DEPPARSE, add_specific_args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/common.py", line 276, in main process_treebank(treebank, model_type, paths, args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py", line 132, in process_treebank prepare_tokenizer_treebank.copy_conllu_treebank(treebank, model_type, paths, paths["DEPPARSE_DATA_DIR"], retag_dataset) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 71, in copy_conllu_treebank process_treebank(treebank, model_type, paths, args) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1196, in process_treebank process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1090, in process_ud_treebank prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1079, in prepare_ud_dataset write_augmented_dataset(input_conllu, output_conllu, augment_punct) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 715, in write_augmented_dataset new_sents = augment_function(sents) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 694, in augment_punct new_sents = augment_apos(sents) File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 415, in augment_apos raise ValueError("Cannot find '# text' in sentences %d. First line: %s" % (sentidx, sent[0])) ValueError: Cannot find '# text' in sentences 84. First line: 1 ne ne ‘not’ PART particle Uninflected=Yes 2 nmod _ I have compared my data with the toy data included in the repository, and it is the same format. I have also tried running that command with the toy data, and it works without issues. Just to double check, according to what it says, it checks the prepared data stored in processed/tokenize, am I right? If that's the case, the format is also the same as with the original training data.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza-train/issues/18#issuecomment-1944279349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIW5WFH6E4LPBJI6ALYTTXOJAVCNFSM6AAAAABCRMJZXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGI3TSMZUHE . You are receiving this because you commented.Message ID: @.***>
Seems to be sorted, thanks!
Hi,
I am in the process of training the POS model, and I think I have the correct word vectors and word embeddings for this task.
Currently, I have:
What I'm struggling with is to place the .pt file in the corresponding directory. I have followed the same directory structure in stanza_resources as for English, and I have the .pt located in the pretrain subfolder in that directory.
When I run this command
python3 -m stanza.utils.training.run_pos UD_English-TEST --max_steps 500
I'm getting the following error