Getting error when pretraining with new languages sanskrit

prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

MIT License

174 stars 32 forks source link

Getting error when pretraining with new languages sanskrit #34

Open Aniruddha-JU opened 2 years ago

Aniruddha-JU commented 2 years ago

We are tring to pre-train a model with initializing indicBART. we use the below command python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART

we are getting below error.

Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated Traceback (most recent call last): File "pretrain_nmt.py", line 968, in run_demo() File "pretrain_nmt.py", line 965, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True) File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained resolved_vocab_files, pretrained_model_name_or_path, init_configuration, init_inputs, *kwargs File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init self.sp_model.Load(vocab_file) File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load return self.LoadFromFile(model_file) File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

Aniruddha-JU commented 2 years ago

return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: /sentencepiece/python/bundled/sentencepiece/src/sentencepiece_processor.cc(848) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

prajdabre commented 2 years ago

Hi,

I think you are not using the version of transformers that I have provided with the toolkit. Either that or your sentencepiece version is not the one in the requirements.txt file.

Kindly uninstall any existing version of transformers by "pip uninstall transformers" and then install the version I have provided in the transformers folder by "cd transformers && python setup.py install"

Also, your command needs some fixing.

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs XX --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path <local path like /home/raj/model_folder/model>

XX should be one of the 11 language tokens that the model supports. Currently, I have not yet included a method to specify new languages. So the way to bypass this would be to use any of the tokens -- as,bn,gu,hi,kn,ml,mr,or,pa,ta,te. Typically choose one token which you dont plan to do any fine-tuning experiments with.

Aniruddha-JU commented 2 years ago

Hi, Thanks for your reply : I am getting the error when I am using this command python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /home/aniruddha/IndicBART.ckpt --port 8080

Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained tokenizer = cls(*init_inputs, *init_kwargs) File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/mbart/tokenization_mbart.py", line 97, in init super().init(args, tokenizer_file=tokenizer_file, **kwargs) File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 135, in init self.sp_model.Load(str(vocab_file)) File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load return self.LoadFromFile(model_file) File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) OSError: Not found: "None": No such file or directory Error #2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True) File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained resolved_vocab_files, pretrained_model_name_or_path, init_configuration, init_inputs, **kwargs File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1863, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Aniruddha-JU commented 2 years ago

But when I am putting: model_path a blank folder then the code is running python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path IndicBART --port 8080

prajdabre commented 2 years ago

Hi,

The error made me realize that there was a tiny bug.

elif "IndicBART" in args.pretrained_model: tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Should be:

elif "IndicBART" in args.pretrained_model: tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Im surprised that it actually worked. Should have thrown an error.

Also the way you specify the --model_path should be /home/aniruddha/IndicBART.ckpt/model

It should actually be path+"/"+prefix where path = /home/aniruddha/IndicBART.ckpt and prefix = model

Thats something I should clarify in the documentation even better.

Please pull the latest code after 15 mins.

Aniruddha-JU commented 2 years ago

Hi, I realized and changed earlier. I have one query also..the model path argument in the above query does not use for any initialize model if we are using --use_official_pretrained and --pre_trained argument. AM I RIGHT? can you please verify

prajdabre commented 2 years ago

Model path is the place where the model is saved. Pretrained model is where the params are loaded.

Aniruddha-JU commented 2 years ago

So,We should not give any exiting model path right. Rather, I am giving anew path where the new pre-trained model will save.. AM I rIGHT? please confirm it once .. --model_path ai4bhart/IndicBART .. this ai4bhart/IndicBART is new directory..

Aniruddha-JU commented 2 years ago

as we are using args.use_official_pretrained so we don't need to give any exiting model path.. Because in your code, model_path is used to store the model, config, and tokenizer, AM I RIGHT?

prajdabre commented 2 years ago

Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path.

If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code).

model_path should be be a local path. I think there is some confusion.

ai4bhart/IndicBART is not a local path. It is an identifier for huggingface.
Since it is a pretrained model it should be passed to --pretrained_model.
Since this is an official model on the huggingface hub, you need to specify an additional flag: --use_official_pretrained

In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path.

Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port 8080

--pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path

--use_official_pretrained because you are loading the official IndicBART model from HF hub.

--model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model

Aniruddha-JU commented 2 years ago

Hi Thank you for your reply. Yes, model_path should be local path, actually I created it as the of ai4bhart/IndicBART like huggingface model name, and I have verified that the model is saving this path, thank you for your reply

Aniruddha-JU commented 2 years ago

hi, i am getting one point..That your code is only working when I am putting .hi extension. Otherwise its getting error. Like when I am passing train.kn it is getting error, and when I renamed the file with train.hi it works.

On Tue, Aug 23, 2022 at 3:52 PM Raj Dabre @.***> wrote:

Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path.

If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code).

model_path should be be a local path. I think there is some confusion.

ai4bhart/IndicBART is not a local path. It is an identifier for huggingface.

Since it is a pretrained model it should be passed to --pretrained_model.

Since this is an official model on the huggingface hub, you need to specify an additional flag: --use_official_pretrained

In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path.

Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port 8080

--pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path

--use_official_pretrained because you are loading the official IndicBART model from HF hub.

--model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model

— Reply to this email directly, view it on GitHub https://github.com/prajdabre/yanmtt/issues/34#issuecomment-1223867525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIWJFZREOXIVU2YETEVIAGLV2SQ5PANCNFSM56SD4VMA . You are receiving this because you authored the thread.Message ID: @.***>