wangyanmeng / FedTAN

Pytorch implementation of FedTAN (federated learning algorithm tailored for batch normalization) proposed in the paper, Why Batch Normalization Damage Federated Learning on Non-IID Data.
MIT License
6 stars 1 forks source link

Question about $\Delta \mathcal{S}_{\mathcal{D}}$ in the paper. #1

Open ybdai7 opened 1 year ago

ybdai7 commented 1 year ago

HI. Recently, i found your paper in arXiv and i am very interested in your solid work. However, i found one question that kept confusing me and i will be very grateful if you can explain it.

It about the $\Delta \mathcal{S}{\mathcal{D}}$ in the paper. In the paper, you said $\mathcal{S}$ contains the running statistics (batch mean/var) in the BN layer. But HOW can you compute the derivatives of both running mean/var? They are not learnable parameters in the traditional BN layer. And i don't think $\nabla{\mathbf{w}} F$ is correlated with running mean/var. Cause again these two parameters are not learnable. They can only be changed in the forward propagation stage. Do you change the traditional BN layer to make them learnable? or i don't see any reason to include the discussion of $\Delta \mathcal{S}{\mathcal{D}}$._

I will be very appreciated if you can reply to my question.

wangyanmeng commented 4 months ago

According to the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," image although the gradients of the batch mean and batch variance are not directly used to update their values, they are essential for computing the value of the input to batch normalization. This, in turn, influences the gradients of other network parameters through backpropagation.

The running average of the batch mean and variance is computed for inference purposes, while the gradients of the batch mean and batch variance are used for computing model gradients.