Closed bestpredicts closed 1 year ago
Hi @bestpredicts, thanks for raising this issue.
I can confirm that I see the same error with the most recent version of transformers and pytorch 2. I wasn't able to replicate the issue with pytorch 1.13.1 and the same transformers version.
Following the messages in the shared error output, if I set LOCAL_RANK
in my environment and pass in --use-env
I am able to run on pytorch 2.
LOCAL_RANK=0,1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 --use-env examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Also note that torch.distributed.launch
is deprecated and torchrun
is preferred in PyTorch 2.0.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Does anyone solved this problem? I got same problem when use torchrun or torch.distributed.launch, the self.local_rank is -1. my env is pytorch==2.0.0 and transorformers=4.30.1.
You might try migrating to torchrun? i.e.:
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
for reference on migrating: https://pytorch.org/docs/stable/elastic/run.html
Have you solve your problems? I came up with the same error when using deepspeed. Solutions provided above didn't work at all. :(
另请注意,它
torch.distributed.launch
已被弃用,并且torchrun
在 PyTorch 2.0 中是首选。
Thanks for this tip.
watching
Print from sys.argv
:
['train.py', '--local-rank=0', '--model_name_or_path', './checkpoints/vicuna-7b-v1.5', ...]
other arguments have the format 'key', 'value', but locak_rank
is not properly parsed. In the above example, local_rank=0
is treated as a whole.
I think this may be something wrong with torch.distributed.launch
, since it appends local_rank=0
to the arguments list, but the appended argument can not be properly parsed by HFArgumentParser
.
So use torchrun
and use --use-env
which uses environment variable LOCAL_RANK
but not arguments --local_rank
is an optional solution.
A hack fix can add this before parse_args_into_dataclasses()
import sys
for arg in sys.argv:
if arg.startswith("--local-rank="):
rank = arg.split("=")[1]
sys.argv.remove(arg)
sys.argv.append('--local_rank')
sys.argv.append(rank)
i have this problem
ValueError: Some specified arguments are not used by the HfArgumentParser: ['-f', '/root/.local/share/jupyter/runtime/kernel-8d0db21b-3ec1-4b17-987c-be497d81b3c5.json']
You might try migrating to torchrun? i.e.:
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
for reference on migrating: https://pytorch.org/docs/stable/elastic/run.html
thanks, it is ok for me
can it run on colab i can't do that
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--only_optimize_lora']
I can run the following command in CMD without issues:
python run_show.py --output_dir output20241021 --model_name_or_path show_model/model001 --train_type use_lora --data_path data/AS_2022_train+test --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --num_train_epochs 5
However, when I try to debug in the VSCODE IDE, I encounter the following error:
ValueError: Some specified arguments are not used by the HfArgumentParser: ['model_name_or_path', 'show_model/model001', 'train_type', 'use_lora', 'data_path', 'data/AS_2022_train+test', 'per_device_train_batch_size', '1', 'per_device_eval_batch_size', '1', 'num_train_epochs', '5']
My JSON settings are as follows:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"args": [
"--output_dir", "output20241021",
"model_name_or_path", "show_model/model001",
"train_type", "use_lora",
"data_path", "data/AS_2022_train+test",
"per_device_train_batch_size", "1",
"per_device_eval_batch_size", "1",
"num_train_epochs", "5"
]
}
]
}
System Info
transformers version 4.7 , pytorch2.0, python3.9
run the example code in document of transformers
error info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1.Install the following configuration environment: python 3.9 pytroch 2.1 dev trasnsformers 4.7
Expected behavior
1.Install the following configuration environment: python 3.9 pytroch 2.1 dev trasnsformers 4.7