huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.48k stars 26.89k forks source link

run_t5_mlm_flax.py #14199

Closed Arij-Aladel closed 2 years ago

Arij-Aladel commented 3 years ago

Hi! @patil-suraj which arg to use for run_t5_mlm_flax.py to run on multiple gpus?

patil-suraj commented 3 years ago

I haven't tested it, but it should run as it is on single-host multi-gpu. You just need to install jax version compatible with your cuda installation, for which you can find the instructions here

Arij-Aladel commented 2 years ago

Using 't5-small' tokenizer is wrong we should use just the pretrained tokenizer . I have tried it and got very low accuraccy. using wikitext /wikitext-103-raw-v1 for 10 epochs as dataset and pretraining the tokenizer over wikitext-103-raw-v1 got 0.5838 accuracy. but using t5-small tokenizer very low accuracy. and no masking is happening just deleting tokens.

Arij-Aladel commented 2 years ago

@patil-suraj this is my try to run the same steps using pytorch. I have tried to use t5-small tokenizer. Also, I trained the given tokenizer in this repo on wikitext to compare.

The results are not the same, seems strange. Training on 10 epochs using :

  1. if tokenizer trained on wiki export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 run_t5_mlm_flax.py --output_dir="./ MLM-128wiki/wikitokenizer” --model_type="t5" --config_name="./wikitext-103-raw-v1" --tokenizer_name="./wikitext-103-raw-v1" --dataset_name="wikitext" --dataset_config_name="wikitext-103-raw-v1" --max_seq_length="128" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --adafactor --learning_rate="0.005" --weight_decay="0.001" --warmup_steps="2000" --overwrite_output_dir --logging_steps="500" --save_steps="10000" --eval_steps="500" --num_train_epochs=10

  2. if tokenizer is t5 -small tokenizer export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 run_t5_mlm_flax.py --output_dir="./ MLM-128wiki/t5-tokenizer” --model_type="t5" --config_name="./wikitext-103-raw-v1" --tokenizer_name="t5-small" --dataset_name="wikitext" --dataset_config_name="wikitext-103-raw-v1" --max_seq_length="128" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --adafactor --learning_rate="0.005" --weight_decay="0.001" --warmup_steps="2000" --overwrite_output_dir --logging_steps="500" --save_steps="10000" --eval_steps="500" --num_train_epochs=10

results

                T5tokenizer                                                                  tokenizer trained on wiki

train loss: 2.307 ------ 2.074 eval loss: 2.254 ------ 1.959

using my code as following: 1. if tokenizer trained on wiki:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 rum_mlm_torch.py --output_dir="./torch/wiki" --model_type="t5" --config_name="./wikitext-103-raw-v1" --tokenizer_name="./wikitext-103-raw-v1" --dataset_name="wikitext" --dataset_config_name="wikitext-103-raw-v1" --max_seq_length="128" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --adafactor --learning_rate="0.005" --weight_decay="0.001" --warmup_steps="2000" --logging_steps="500" --save_steps="10000" --eval_steps="1000" --do_train --do_eval --do_predict --overwrite_output_dir --report_to='wandb' --num_train_epochs=10 --evaluation_strategy steps

2. if tokenizer is t5 tokenizer:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 rum_mlm_torch.py --output_dir="./torch/t5tokenizer" --model_type="t5" --config_name="./wikitext-103-raw-v1" --tokenizer_name="t5-small" --dataset_name="wikitext" --dataset_config_name="wikitext-103-raw-v1" --max_seq_length="128" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --adafactor --learning_rate="0.005" --weight_decay="0.001" --warmup_steps="2000" --logging_steps="500" --save_steps="10000" --eval_steps="1000" --do_train --do_eval --do_predict --overwrite_output_dir --report_to='wandb' --num_train_epochs=10 --evaluation_strategy steps

results:

                T5tokenizer                                                                  tokenizer trained on wiki

train loss: 4.675 ------ 3.961 eval loss: 4.562 ------ 3.8

@patil-suraj @patrickvonplaten, Any explanation whyusing flax giving much more better results using torch?

patrickvonplaten commented 2 years ago

@Arij-Aladel - could you specify your question here a bit? What exactly is the issue?

Arij-Aladel commented 2 years ago

@patrickvonplaten I need to train T5 from hugging face from scratch on mlm task using pytorch. To my knowledge, there is no example on your repo to do that. The main issue that the same dataset preprocessing using the same T5 model but with two different frameworks flax and pytorch gave me different results. I did not change anything in the original run_mlm_flax.py code I just tried to use pytorch and Trainer instead. Everything is still as in the original code so why I am getting different results? I need torch version cause I have already built my model based on T5 from huggingface and I need also to train my model on mlm task and compare it with T5 from hugging face. That is why I started with T5 first as a baseline. I have decided as the first step to use wikitext-103-raw-v1 dataset for pretraining. The first question was in my mind which tokenizer to use so I have tried t5-small tokenizer to pretrain using the original script, then I trained the tokenizer on train split of wikitext-103-raw-v1 dataset .

  1. First issue was using the pretrained tokenizer on wikitext-103-raw-v1 dataset gave me better results and this raise another question in my mind , If I need to pretrain the model on mlm task then finetune it on another task, which tokenizer to use? I mean do I need to pretrain the tokenizer again and again evry time I will use new dataset? or simply uset 5-small tokenizer everywhere? or decide which datasets will be used in my experiements train the tokenizer on all train splits then do the pretraining and funetuning?
  2. Second Issue : trying to mimic run_mlm_flax.py using torch Trainer keeping the dataset preprocessing and collator class with no change resulted in unsatisfied results even I tried to train on 100 epochs, still using 10 epochs with original script gives better results. Can you please guid me to the reason? I do not need the flax version I need torch pipeline to train T5 on mlm task from scratch. Seems my try was not good
patrickvonplaten commented 2 years ago

Hey @Arij-Aladel,

We currently don't have any support to pre-train T5 from scratch in PyTorch. We only have a script in Flax and we recommend https://github.com/google-research/text-to-text-transfer-transformer for training in TF.

Could you maybe instead try whether you can find support for pretraining in PyTorch on the forum: https://discuss.huggingface.co/ ?

Thanks!

Arij-Aladel commented 2 years ago

@patrickvonplaten I know that that is why I tried it myself and the performance using pytorch version is not satisfied even the masking pipeline is the same I have tracked it. That is why I asked why is the performance of flax T5 different of pytorch T5

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ToluClassics commented 2 years ago

@patrickvonplaten When i log jax.local_device_count() and jax.device_count() when running the run_t5_mlm_flax.py script it returns 1 even though i'm training with multiple GPUs on a single host. Any ideas how I can fix?

patrickvonplaten commented 2 years ago

Hey @ToluClassics - this issue seems to be related to JAX rather than the Transformers library - could you try to open an issue there? :-)

Eurus-Holmes commented 2 years ago

@patrickvonplaten I need to train T5 from hugging face from scratch on mlm task using pytorch. To my knowledge, there is no example on your repo to do that. The main issue that the same dataset preprocessing using the same T5 model but with two different frameworks flax and pytorch gave me different results. I did not change anything in the original run_mlm_flax.py code I just tried to use pytorch and Trainer instead. Everything is still as in the original code so why I am getting different results? I need torch version cause I have already built my model based on T5 from huggingface and I need also to train my model on mlm task and compare it with T5 from hugging face. That is why I started with T5 first as a baseline. I have decided as the first step to use wikitext-103-raw-v1 dataset for pretraining. The first question was in my mind which tokenizer to use so I have tried t5-small tokenizer to pretrain using the original script, then I trained the tokenizer on train split of wikitext-103-raw-v1 dataset .

  1. First issue was using the pretrained tokenizer on wikitext-103-raw-v1 dataset gave me better results and this raise another question in my mind , If I need to pretrain the model on mlm task then finetune it on another task, which tokenizer to use? I mean do I need to pretrain the tokenizer again and again evry time I will use new dataset? or simply uset 5-small tokenizer everywhere? or decide which datasets will be used in my experiements train the tokenizer on all train splits then do the pretraining and funetuning?
  2. Second Issue : trying to mimic run_mlm_flax.py using torch Trainer keeping the dataset preprocessing and collator class with no change resulted in unsatisfied results even I tried to train on 100 epochs, still using 10 epochs with original script gives better results. Can you please guid me to the reason? I do not need the flax version I need torch pipeline to train T5 on mlm task from scratch. Seems my try was not good

@Arij-Aladel Hi, do you have any updates on this issue? I'm also trying to pre-train T5 from scratch in PyTorch, can you share your scripts of run_t5_mlm_flax.py? I just converted the parameters trained from flax to the pytorch:

model = FlaxT5ForConditionalGeneration.from_pretrained(pretrained_path)
pt_model = T5ForConditionalGeneration.from_pretrained(tmp_path, from_flax=True)

but it seems doesn't work.

And for your first issue, I think we need to retrain the tokenizer every time when we use new datasets.

Arij-Aladel commented 2 years ago

@Eurus-Holmes Hi! yes of course you can find an example here https://github.com/Arij-Aladel/T5-Tasks