Open Pier297 opened 1 year ago
Hi,
Thanks for following up. Yeah, the engineering specifics are perhaps somewhat hairy, so I'll mostly comment on the high-level ideas for now.
If your goal is to fine-tune a large model that's small enough to be fit on a single GPU, I think plain dataparallel is sufficient. FWIW, it's also simpler and doesn't require dealing with the complexity of model/pipeline/FS parallelism.
If your model can't really be fit on a single GPU, then you'd strictly need some of the features in deepspeed. But I think alternative to deepspeed, FSDP might be a better option. I've personally given it a try with DP, and it seems workable.
FSDP essentially enables optimizer sharding and weight sharing, so you'll be able to optimize models that can't fit on single accelerators. The central ideas of FSDP and deepspeed are pretty much the same.
Hi and thank you again for your help!
I tried FSDP since my model doesn't fit on a single-gpu but I'm not sure how to proceed because when I call privacy_engine.attach(optimizer)
I get the following error:
ValueError: Model type <class 'torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel'> is not supported
which was also the error I more or less got when I tried opacus with deepspeed.
If it can help the model I'm testing with is gpt2 and my code more or less is this:
my_auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=20000
)
model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy, cpu_offload=CPUOffload(offload_params=True))
privacy_engine = PrivacyEngine(model, ...)
privacy_engine.attach(optimizer)
Really sorry for taking your time and really thanks for any help you can give me :)
Hi all,
I wanted to try and add support for multi-gpu training to allow the fine-tuning of LLM. I've already opened an issue a few weeks ago and also thanks a lot for the fast response :)
I was trying to understand how we could use deepspeed (to use zero 3) and I saw that in your library the gradient is computed here by calling
backward
.In deepspeed the backward pass is handled by the DeepSpeedEngine here, but if I'm not mistaken it's not different than calling
backward
as you do, what changes is the model parameter update done by the step function.More or less my idea would be:
privacy_engine.virtual_step(micro_batch_loss)
, this will then call _accumulate_summed_grad that computes the gradient for that micro-batchoptimizer.step
as usualI'm not really an expert using deepspeed so I don't know if this would be the correct solution and any suggestion that you could give would be much appreciated :)
If you prefer you can contact me via email at: p.tarasco@studenti.unipi.it Thanks a lot!