Closed nathan-az closed 11 months ago
Closing this, although frankly I have no idea what I did to fix it, I think the problem was on the transformers/accelerate side.
Hi Nathan, may I know how you fixed this issue on your side eventually? Got the same bug here... thanks!
Hey @ziyi-yang , unfortunately I never worked out what the issue was.
I swapped to using accelerate
to launch my training job, while using deepspeed
with the pdsh
launcher, which should work very similarly (if not identically).
It's possible that I was incorrectly launching the job, and accelerate
abstracted away something I was doing wrong. If you're interested in trying to use accelerate
to launch the job, the relevant config snippet that I'm using is:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_multinode_launcher: pdsh
deepspeed_hostfile: {ds_hostfile_path}
deepspeed_config_file: {ds_config_path}
zero3_init_flag: true
distributed_type: DEEPSPEED
Note that in my case since I'm using PDSH, I only run this command on the main process. Let me know if you have any specific questions about my setup, and sorry I can't be of more help!
Not at all, really appreciate your quick and detailed reply.
Hey @ziyi-yang , unfortunately I never worked out what the issue was.
I swapped to using
accelerate
to launch my training job, while usingdeepspeed
with thepdsh
launcher, which should work very similarly (if not identically).It's possible that I was incorrectly launching the job, and
accelerate
abstracted away something I was doing wrong. If you're interested in trying to useaccelerate
to launch the job, the relevant config snippet that I'm using is:compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_multinode_launcher: pdsh deepspeed_hostfile: {ds_hostfile_path} deepspeed_config_file: {ds_config_path} zero3_init_flag: true distributed_type: DEEPSPEED
Note that in my case since I'm using PDSH, I only run this command on the main process. Let me know if you have any specific questions about my setup, and sorry I can't be of more help!
hi, i also encountered this batch num issue when using ds zero3 to launch the job, while zero2 works fine. could u please share the config file and the launch script using accelerate? thanks in advance!
Hey @Elenore1997 , I don't have my versions anymore.
However the following links may be useful for you, especially if you're interested in training qlora with limited hardware: deepspeed, fsdp.
Hope those are helpful - they both provide example scripts with accelerate configs. If you need a more customised deepspeed setup you can also choose to link a deepspeed json config in your accelerate yaml, rather than using the limited options available directly on the accelerate yaml.
Note that I saw slightly lower memory usage using FSDP rather than deepspeed, and was able to finetune a 70B qlora on 4x 24GB GPUs without any CPU offloading (although surprisingly, enabling CPU offloading yielded faster training).
Sorry I can't provide my specific examples, but let me know if you have further questions.
Hey @Elenore1997 , I don't have my versions anymore.
However the following links may be useful for you, especially if you're interested in training qlora with limited hardware: deepspeed, fsdp.
Hope those are helpful - they both provide example scripts with accelerate configs. If you need a more customised deepspeed setup you can also choose to link a deepspeed json config in your accelerate yaml, rather than using the limited options available directly on the accelerate yaml.
Note that I saw slightly lower memory usage using FSDP rather than deepspeed, and was able to finetune a 70B qlora on 4x 24GB GPUs without any CPU offloading (although surprisingly, enabling CPU offloading yielded faster training).
Sorry I can't provide my specific examples, but let me know if you have further questions.
it's nice of you! thanks for your quick reply, i will try the links you provided.
I am using the deepspeed launcher with the HuggingFace Trainer in my script. My script makes zero mention of deepspeed or accelerate. My understanding was that the Trainer takes care of this.
I am running my job using
deepspeed --hostfile ... scripts/run_sft.py <script_args>
. I'm running on a 3 node cluster with passwordless SSH, each with 4 GPUs. The deepspeed config is being passed via--deepspeed=
and theHfArgumentParser
used to process the training args, passing the deepspeed config into the trainer.The gist of my issue is this:
Note how the batch size set by
transformers
is 192 == 8 (per device size) 2 (gradient accumulation) 12 (args.world_size). But thedeepspeed.runtime.config
object hasworld_size
1
.It seems that
train_batch_size
is set by thetransformers
integration usingargs.world_size
. This is causing the deepspeed assertion to fail.After some painstaking debugging I think that my world size inside
deepspeed.runtime.config
is being set in the except block indeepspeed
, because thedist
variable isNone
, so the world_size is being set to 1, causing the mismatch. I believe that maybe thcdb
isn't set yet? I believe it'sNoneType
which is what's triggering the exception. I don't know if this is correct and world size should stay as 1, or if the problem is in transformers andargs.world_size
(set by the deepspeed launcher I believe) should not be respected in this context.My deepspeed config does not specify any batch sizes, letting the Trainer handle these to avoid conflict:
ds_report
output:System info (please complete the following information):
g5.12xlarge
(3 nodes, 4 A10 GPUs each)authorized_keys
. They do not all have each other's keys though, I do not know if that is critical)Launcher context deepspeed launcher with hostfile The launcher appears to be launching and distributing successfully: