stanfordnlp / stanza-train

Model training tutorials for the Stanza Python NLP Library
https://stanfordnlp.github.io/stanza/
37 stars 16 forks source link

Assert len(splits) == 2, "Unable to process %s" % treebank #16

Open Shanmathi2002 opened 1 year ago

Shanmathi2002 commented 1 year ago

Hi,

I'm starting by using the data included in Stanza's packages to learn how Stanza works before attempting to train my own language model for a different data.

When I run the command,

(base) E:\stanza_model_try>python -m stanza.utils.datasets.prepare_tokenizer_treebank E:\stanza_model_try\stanza-train\data\udbase\UD_English-TEST

the following error appears:

2023-09-22 20:10:11 INFO: Datasets program called with: C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py E:\stanza_model_try\stanza-train\data\udbase\UD_English-TEST Traceback (most recent call last): File "C:\Users\dell\anaconda3\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\dell\anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1217, in <module> main() File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1214, in main common.main(process_treebank, common.ModelType.TOKENIZER, add_specific_args) File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\common.py", line 271, in main process_treebank(treebank, model_type, paths, args) File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py", line 1168, in process_treebank short_name = treebank_to_short_name(treebank) File "C:\Users\dell\anaconda3\lib\site-packages\stanza\models\common\constant.py", line 493, in treebank_to_short_name assert len(splits) == 2, "Unable to process %s" % treebank AssertionError: Unable to process E:\stanza_model_try\stanza-train\data\udbase\UD_English-TEST

I referred many issues related to this one and made changes, but nothing worked out, I'm also a beginner to this one. I'm trying to do the same for treebank-UD_Tamil-TTB , but found the exact same error.

File "C:\Users\dell\anaconda3\lib\site-packages\stanza\models\common\constant.py", line 493, in treebank_to_short_name assert len(splits) == 2, "Unable to process %s" % treebank AssertionError: Unable to process E:\stanza_model_try\stanza-train\data\udbase\UD_Tamil-TTB

Below is my config file:

export DATA_ROOT="/e/stanza_model_try/stanza-train/data/processed"
export TOKENIZE_DATA_DIR=$DATA_ROOT/tokenize
export MWT_DATA_DIR=$DATA_ROOT/mwt
export LEMMA_DATA_DIR=$DATA_ROOT/lemma
export POS_DATA_DIR=$DATA_ROOT/poscd 
export DEPPARSE_DATA_DIR=$DATA_ROOT/depparse
export ETE_DATA_DIR=$DATA_ROOT/ete

Currently I'm not using NER and any wordvec so commented that part,

Below is my directory...

data
├── nerbase
├── processed
├── tokenize
├── udbase
│   ├── UD_English-TEST
│   │   ├── en_test-ud-dev.conllu
│   │   ├── en_test-ud-dev.txt
│   │   ├── en_test-ud-test.conllu
│   │   ├── en_test-ud-test.txt
│   │   ├── en_test-ud-train.conllu
│   │   ├── en_test-ud-train.txt
│   └── UD_Tamil-TTB
│       ├── ta_ttb-ud-dev.conllu
│       ├── ta_ttb-ud-dev.txt
│       ├── ta_ttb-ud-test.conllu
│       ├── ta_ttb-ud-test.txt
│       ├── ta_ttb-ud-train.conllu
│       └── ta_ttb-ud-train.txt
└── wordvec
    ├── word2vec
        └── English
             └── en.vectors.txt
             └── en.vectors.zip

So i need help to resolve this error and successfully train my own data, please ignore my silly mistakes if there are any, and sorry to open this kind of issues again even though there are many closed issues for the same problem. Any suggestions or help will be more helpful.

Thanks :)

AngledLuffa commented 1 year ago

Well, it's hard to be angry over such a detailed and politely worded error message.

The issue is that the script doesn't expect a complete path to the treebank. This works for me:

[john@localhost stanza]$ ls /home/john/stanza-train/data/udbase
UD_English-TEST
[john@localhost stanza]$ export UDBASE=/home/john/stanza-train/data/udbase
[john@localhost stanza]$ python3 stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST

whereas you are giving prepare_tokenizer_treebank a complete path.

I may look into making that a legal thing to do, since it seems like a natural enough option, but for now I suggest just setting your UDBASE correctly and then running with UD_English-TEST instead of the path.

Similarly, I have no difficulty with the Tamil dataset:

[john@localhost stanza]$ echo $UDBASE
/home/john/extern_data/ud2/ud-treebanks-v2.12
[john@localhost stanza]$ python3 stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Tamil-TTB
2023-09-22 09:33:28 INFO: Datasets program called with:
stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Tamil-TTB
Preparing data for UD_Tamil-TTB: ta_ttb, ta
Shanmathi2002 commented 1 year ago

Thankyou for your immediate response : )

[john@localhost stanza]$ ls /home/john/stanza-train/data/udbase
UD_English-TEST
[john@localhost stanza]$ export UDBASE=/home/john/stanza-train/data/udbase
[john@localhost stanza]$ python3 stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST

I tried our above suggestionm but got another set of errors : (

Shrishanmathi@DESKTOP-H09P49G MINGW64 /e/stanza_model_try/stanza-train (master)
$ export UDBASE=/e/stanza_model_try/stanza-train/data/udbase

Shrishanmathi@DESKTOP-H09P49G MINGW64 /e/stanza_model_try/stanza-train (master)
$ echo $UDBASE
/e/stanza_model_try/stanza-train/data/udbase

Shrishanmathi@DESKTOP-H09P49G MINGW64 /e/stanza_model_try/stanza-train (master)
$ python -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST
2023-09-23 00:08:56 INFO: Datasets program called with:
C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py UD_Englis
h-TEST
Traceback (most recent call last):
  File "C:\Users\dell\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\dell\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py",
 line 1217, in <module>
    main()
  File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py",
 line 1214, in main
    common.main(process_treebank, common.ModelType.TOKENIZER, add_specific_args)
  File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\common.py", line 271, in main
    process_treebank(treebank, model_type, paths, args)
  File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\prepare_tokenizer_treebank.py",
 line 1197, in process_treebank
    train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail
=True)
  File "C:\Users\dell\anaconda3\lib\site-packages\stanza\utils\datasets\common.py", line 170, in find_t
reebank_dataset_file
    raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename))
FileNotFoundError: Could not find any treebank files which matched /e/stanza_model_try/stanza-train/dat
a/udbase\UD_English-TEST\*-ud-train.conllu

I guess it has something to do with my path files, i will try and check them again, and any comments/suggestions from your side will also be helpful.

Thanks !

AngledLuffa commented 1 year ago

I don't know anything about mingw, but on regular windows, I would expect the path to be e:\...

So for example, on my Windows machine with Java installed, I can do:

C:\Users\horat>echo %JAVA_HOME%
C:\Program Files\Java\jdk-14.0.1

C:\Users\horat>dir "%JAVA_HOME%"
 Volume in drive C is Windows
 Volume Serial Number is 98CE-4900

 Directory of C:\Program Files\Java\jdk-14.0.1

2020-05-25  04:14 PM    <DIR>          .
....
Shanmathi2002 commented 1 year ago

I'm really sorry for the delayed response from my side,

I don't know anything about mingw, but on regular windows, I would expect the path to be e:\...

I'm having Windows, and I'm using GitBash to execute those linux commands on my PC. "MINGW64" in terminal prompt just indicates that I'm using the Git Bash shell on a Windows system. I went through all the steps once again and checked my config.sh file, made the required path changes. And executed the following command,

/e/stanza_model_try/stanza-train/stanza (main) $ python -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST Then I get the following error:

 File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\datasets\common.py", l
ine 170, in find_treebank_dataset_file
    raise FileNotFoundError("Could not find any treebank files which matched {}".fo
rmat(filename))
FileNotFoundError: Could not find any treebank files which matched /e/stanza_model_
try/stanza-train/data/udbase\UD_English-TEST\*-ud-train.conllu

But once i give the full path, instead of UD_English-TEST then i get the error which i mentioned in my first comment [https://github.com/stanfordnlp/stanza-train/issues/16#issue-1909166003]

So now it seems I'm stuck between these two errors !, Is there anything I'm missing ?

Thanks in Advance !

AngledLuffa commented 1 year ago

As I mentioned above, based on what I know of Windows paths, your UDBASE path almost certainly needs to start with e:\stanza_model_try\stanza-train

Shanmathi2002 commented 1 year ago

Hey thanks for your reply, and it worked !

Now I'm facing another issue when executing the next command, $ python -m stanza.utils.training.run_tokenizer UD_English-TEST --step 500

2023-09-24 18:42:37 INFO: Training program called with:
E:\stanza_model_try\stanza-train\stanza\stanza\utils\training\run_tokenizer.py U
D_English-TEST --step 500
2023-09-24 18:42:37 DEBUG: UD_English-TEST: en_test
2023-09-24 18:42:37 INFO: Save file for en_test model: en_test_tokenizer.pt
2023-09-24 18:42:37 INFO: UD_English-TEST: saved_models\tokenize\en_test_tokeniz
er.pt does not exist, training new model
2023-09-24 18:42:37 INFO: Running train step with args: ['--label_file', 'E:\\st
anza_model_try\\stanza-train\\data\\processedtokenize/en_test-ud-train.toklabels
', '--txt_file', 'E:\\stanza_model_try\\stanza-train\\data\\processedtokenize/en
_test.train.txt', '--lang', 'en', '--max_seqlen', '100', '--mwt_json_file', 'E:\
\stanza_model_try\\stanza-train\\data\\processedtokenize/en_test-ud-dev-mwt.json
', '--dev_txt_file', 'E:\\stanza_model_try\\stanza-train\\data\\processedtokeniz
e/en_test.dev.txt', '--dev_label_file', 'E:\\stanza_model_try\\stanza-train\\dat
a\\processedtokenize/en_test-ud-dev.toklabels', '--dev_conll_gold', 'E:\\stanza_
model_try\\stanza-train\\data\\processedtokenize/en_test.dev.gold.conllu', '--co
nll_file', 'C:\\Users\\dell\\AppData\\Local\\Temp\\tmpb9uwiz7b', '--shorthand',
'en_test', '--step', '500', '--save_name', 'en_test_tokenizer.pt', '--save_dir',
 'saved_models\\tokenize']
2023-09-24 18:42:37 INFO: Running tokenizer in train mode
2023-09-24 18:42:37 DEBUG: 2 sentences loaded.
2023-09-24 18:42:37 INFO: Found no mwts in the training data.  Setting use_mwt t
o False
2023-09-24 18:42:40 INFO: Step     20/   500 Loss: 0.541
2023-09-24 18:42:41 INFO: Step     40/   500 Loss: 0.437
2023-09-24 18:42:43 INFO: Step     60/   500 Loss: 0.188
2023-09-24 18:42:45 INFO: Step     80/   500 Loss: 0.094
2023-09-24 18:42:47 INFO: Step    100/   500 Loss: 0.064
2023-09-24 18:42:48 INFO: Step    120/   500 Loss: 0.056
2023-09-24 18:42:50 INFO: Step    140/   500 Loss: 0.032
2023-09-24 18:42:52 INFO: Step    160/   500 Loss: 0.015
2023-09-24 18:42:54 INFO: Step    180/   500 Loss: 0.021
2023-09-24 18:42:55 INFO: Step    200/   500 Loss: 0.004
Traceback (most recent call last):
  File "C:\Users\dell\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\dell\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\training\run_tokeni
zer.py", line 109, in <module>
    main()
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\training\run_tokeni
zer.py", line 106, in main
    common.main(run_treebank, "tokenize", "tokenizer", sub_argparse=tokenizer.bu
ild_argparse())
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\training\common.py"
, line 183, in main
    run_treebank(mode, paths, treebank, short_name,
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\training\run_tokeni
zer.py", line 80, in run_treebank
    tokenizer.main(train_args)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\models\tokenizer.py", lin
e 128, in main
    train(args)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\models\tokenizer.py", lin
e 202, in train
    dev_score = eval_model(args, trainer, dev_batches, vocab, mwt_dict)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\models\tokenization\utils
.py", line 446, in eval_model
    oov_count, N, all_preds, doc = output_predictions(args['conll_file'], traine
r, batches, vocab, mwt_dict, args['max_seqlen'])
  File "E:\stanza_model_try\stanza-train\stanza\stanza\models\tokenization\utils
.py", line 329, in output_predictions
    if output_file: CoNLL.dict2conll(doc, output_file)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\conll.py", line 153
, in dict2conll
    CoNLL.write_doc2conll(doc, filename)
  File "E:\stanza_model_try\stanza-train\stanza\stanza\utils\conll.py", line 197
, in write_doc2conll
    with open(filename, mode='w', encoding=encoding) as outfile:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\dell\\AppData\\Local\
\Temp\\tmpb9uwiz7b'

Since the error is related to some access permission, I ran it as run as administrator, That didn't work. I also searched for the error in stack overflow, and modified the permissions to the required folder and all the permissions are already given and also the file mentioned in the error is created in the directory but at the end it is found no more, i have attached the below images for clarification.

109354ba-b607-4289-837f-85816705bf9c

69f9c91a-197f-45c4-a5ba-d5f18b9339f6

2bf21dbb-783b-4f92-b58f-e0de63e346e0

If there is anything you can help with, it would be more grateful, I also want to know if there are any alternate solutions for this and sorry again for troubling you with this.

Thanks !

AngledLuffa commented 1 year ago

It looks like your temp folder is not accessible, for whatever reason. I would encourage you not to use admin privileges to work around issues like this, but rather fix the underlying issue.

One option would be to change your TEMP path to something you can access. For example, mine is set to

TEMP=C:\Users\horat\AppData\Local\Temp

Another would be to somehow make it accessible. I found this, but I'm not a Windows expert, so I encourage you to figure it out yourself:

https://community.spiceworks.com/topic/2300942-windows-10-temp-folder-access-denied