huseyinatahaninan / Differentially-Private-Fine-tuning-of-Language-Models

66 stars 15 forks source link

A question about the implementation #1

Closed Chun-wei-Ho closed 2 years ago

Chun-wei-Ho commented 2 years ago

Hello, I have a question about the implementation of DPSGD.

In the following code, you tried to add $N$ Gaussian random variables to p.grad, where $N=\text{batch size}$. But I'm wondering why the standard deviation is set to be sigma/batch_size, not sigma/np.sqrt(batch_size).

To be more specific, let $X_1, X_2, \cdots XN \sim i.i.d. \mathcal{N}(0, \sigma^2)$, it's been proven that $$\frac{1}{N}\sum{i=1}^N X_i \sim \mathcal{N}(0, \frac{\sigma^2}{N})$$

Therefore, the stander deviation here should be $\sigma/\sqrt{N}$. I'm not sure if I missed some implementation details. Can you check it out for me? Thanks.

https://github.com/huseyinatahaninan/Differentially-Private-Fine-tuning-of-Language-Models/blob/a9f7259436c824cb613a66b3a441f4410c036ed2/Language-Understanding-RoBERTa/bert_lora/fairseq/trainer.py#L396-L404

dayu11 commented 2 years ago

Thanks for the question. DPSGD adds a single noise to the sum of gradients, not a fresh sample for each individual gradient. See Algo.1 in https://arxiv.org/pdf/1607.00133.pdf for more details.

Chun-wei-Ho commented 2 years ago

Thank you for the kind response. I misinterpreted the formula $$\tilde{g}_t \leftarrow \frac{1}{L} \left( \sum_i \bar{g}_t(x_i) + \mathcal{N}(0, \sigma^2 C^2 \mathbf{I}) \right)$$ into $$\tilde{g}_t \leftarrow \frac{1}{L} \sum_i \left( \bar{g}_t(x_i) + \mathcal{N}(0, \sigma^2 C^2 \mathbf{I}) \right)$$

It's really helpful of you to provide me the code of Opacus on GitHub. Thank you again for help me out with this. It would've taken me more time if you didn't give me the hint.

dayu11 commented 2 years ago

it's my pleasure. Thanks again for your question.