Closed Chun-wei-Ho closed 2 years ago
Thanks for the question. DPSGD adds a single noise to the sum of gradients, not a fresh sample for each individual gradient. See Algo.1 in https://arxiv.org/pdf/1607.00133.pdf for more details.
Thank you for the kind response. I misinterpreted the formula $$\tilde{g}_t \leftarrow \frac{1}{L} \left( \sum_i \bar{g}_t(x_i) + \mathcal{N}(0, \sigma^2 C^2 \mathbf{I}) \right)$$ into $$\tilde{g}_t \leftarrow \frac{1}{L} \sum_i \left( \bar{g}_t(x_i) + \mathcal{N}(0, \sigma^2 C^2 \mathbf{I}) \right)$$
It's really helpful of you to provide me the code of Opacus on GitHub. Thank you again for help me out with this. It would've taken me more time if you didn't give me the hint.
it's my pleasure. Thanks again for your question.
Hello, I have a question about the implementation of DPSGD.
In the following code, you tried to add $N$ Gaussian random variables to
p.grad
, where $N=\text{batch size}$. But I'm wondering why the standard deviation is set to besigma/batch_size
, notsigma/np.sqrt(batch_size)
.To be more specific, let $X_1, X_2, \cdots XN \sim i.i.d. \mathcal{N}(0, \sigma^2)$, it's been proven that $$\frac{1}{N}\sum{i=1}^N X_i \sim \mathcal{N}(0, \frac{\sigma^2}{N})$$
Therefore, the stander deviation here should be $\sigma/\sqrt{N}$. I'm not sure if I missed some implementation details. Can you check it out for me? Thanks.
https://github.com/huseyinatahaninan/Differentially-Private-Fine-tuning-of-Language-Models/blob/a9f7259436c824cb613a66b3a441f4410c036ed2/Language-Understanding-RoBERTa/bert_lora/fairseq/trainer.py#L396-L404