lxuechen / private-transformers

A codebase that makes differentially private training of transformers easy.
https://arxiv.org/abs/2110.05679
Apache License 2.0
157 stars 23 forks source link

Support for multi-gpu private fine-tuning #32

Open Pier297 opened 1 year ago

Pier297 commented 1 year ago

Hi all,

I wanted to try and add support for multi-gpu training to allow the fine-tuning of LLM. I've already opened an issue a few weeks ago and also thanks a lot for the fast response :)

I was trying to understand how we could use deepspeed (to use zero 3) and I saw that in your library the gradient is computed here by calling backward.

In deepspeed the backward pass is handled by the DeepSpeedEngine here, but if I'm not mistaken it's not different than calling backward as you do, what changes is the model parameter update done by the step function.

More or less my idea would be:

  1. Like in Data Parallel each gpu computes the loss on a micro-batch (also this would be done with model parallelism by deepspeed zero 3? not really understood this perfectly)
  2. Each gpu then calls privacy_engine.virtual_step(micro_batch_loss), this will then call _accumulate_summed_grad that computes the gradient for that micro-batch
  3. We now have to syncronize the gradients by summing them across the gpus (note: this has to sum the param.summed_grad)
  4. We can now call privacy_engine._create_noisy_clipped_gradient() to privatise the gradient
  5. Perform the optimizer.step as usual

I'm not really an expert using deepspeed so I don't know if this would be the correct solution and any suggestion that you could give would be much appreciated :)

If you prefer you can contact me via email at: p.tarasco@studenti.unipi.it Thanks a lot!

lxuechen commented 1 year ago

Hi,

Thanks for following up. Yeah, the engineering specifics are perhaps somewhat hairy, so I'll mostly comment on the high-level ideas for now.

If your goal is to fine-tune a large model that's small enough to be fit on a single GPU, I think plain dataparallel is sufficient. FWIW, it's also simpler and doesn't require dealing with the complexity of model/pipeline/FS parallelism.

If your model can't really be fit on a single GPU, then you'd strictly need some of the features in deepspeed. But I think alternative to deepspeed, FSDP might be a better option. I've personally given it a try with DP, and it seems workable.

FSDP essentially enables optimizer sharding and weight sharing, so you'll be able to optimize models that can't fit on single accelerators. The central ideas of FSDP and deepspeed are pretty much the same.

Pier297 commented 1 year ago

Hi and thank you again for your help!

I tried FSDP since my model doesn't fit on a single-gpu but I'm not sure how to proceed because when I call privacy_engine.attach(optimizer) I get the following error:

ValueError: Model type <class 'torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel'> is not supported

which was also the error I more or less got when I tried opacus with deepspeed.

If it can help the model I'm testing with is gpt2 and my code more or less is this: my_auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=20000) model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy, cpu_offload=CPUOffload(offload_params=True)) privacy_engine = PrivacyEngine(model, ...) privacy_engine.attach(optimizer)

Really sorry for taking your time and really thanks for any help you can give me :)