Question about first layer

Luciennnnnnn commented 2 years ago

Hi, I have counted a problem in my reading. As you stated in appendix 1, for input drawn uniformly at random in the interval [-1, 1], pushing this input through a sine nonlinearity yields an arcsine distribution as input for dot product as later layers do. So according to this correct SIREN should composed as sin->linear->sin->linear->...->linear. But actually, what you do in your code is linear->sin->linear->sin->...->linear, so my question is why are you have chosen this implementation since the first one follows the distribution assumption correctly and the second one does not.

Please tell me the right answer or point out my mistakes, thanks!

Best regards, Xin Luo.

Luciennnnnnn commented 2 years ago

Waiting for answers

Luciennnnnnn commented 2 years ago

And why the last linear layer of SIREN also divides omega in initialization without multiple omega in the calculation.

pielbia commented 2 years ago

Regarding your first question about the architecture, the linear module performs the operation Wx+b which is the argument of the following sine activation function sin(Wx+b) hence the sin module following the linear module. According to the paper, with a uniformly distributed input x in the interval [-1, 1] you should get a normally distributed output from the linear module and an arcsin distributed output from the sin activation layer. You should be able to find the details in the appendix of the paper.

Luciennnnnnn commented 2 years ago

@pielbia Thank you for your reply. You said "with a uniformly distributed input x in the interval [-1, 1] you should get a normally distributed output from the linear module", but in the appendix 1.1 Overview of the proof, the author stated is "The second layer computes a linear combination of such arcsine distributed outputs,..., this linear combination will be normal distributed". The analysis is based on input is arcsine distributed not uniformly distributed.

pielbia commented 2 years ago

@LuoXin-s The input to the second layer is arcsin distributed because it comes out of the sine module from the first layer. I was talking about the input to the first layer, i.e. the input to the network. Consider that each layer is a combination of a linear module followed by a sin module. From the overview of the proof you have cited: "we consider an input in the interval [−1, 1]. We assume it is drawn uniformly at random, since we interpret it as a “normalized coordinate" in our applications", this is the input I was referring to, the input to the network/first layer. After a linear combination of this input with the weights and biases, it is pushed through the sine non-linearity. The output of the sine is arcsin distributed and provides the input to the second layer you were referring to.

Luciennnnnnn commented 2 years ago

@pielbia My point is what authors proved in the paper is that linear combination of arcsine is normaly distributed with corresponding variance, and this normal distribution will become to arcsine after sin, and work recurrenctly to following layers. In first layer, the difference lays on the input is uniform distribution not arcsine distribution, so after linear combination, it may also be normaly distributed according to central limit threom, but the variance not need be same as that the arcsine after linear combination. So in first layet, we also initialize it according to analysis based on assumption that input is arcsine distribution, this may be wrong.

pielbia commented 2 years ago

@LuoXin-s I see your point. The distributions after the linear combination are Gaussian in both cases but you're right about them having different variance. Theorem 1.8 states that the input is uniform between [-1, 1] so it looks to me that it was taken into account that the input to the network didn't follow an arcsin distribution but a uniform one. And the two distribution, if defined over the same interval, have the same mean and a variance differing by a constant factor (Lemma 1.3 and 1.7). The initialization of the first layer is different from the initialization of the other layers. The weights of the first layer also follow a uniform distribution but over a larger interval. What I don't get is where the interval used for the weights of the first layer comes from. The weights for the other layer seems to follow the initialization scheme from the paper. Also, to answer your other question, the weights are multiplied by omega in the forward method of the sin module in module.py

Luciennnnnnn commented 2 years ago

@pielbia I see, as mean and variance of arcsine and uniform only differing by a constant factor, the output after linear combination may also dose. This seems intuitive to me, but about the omega the author states from a different viewpoint in section 3.2. There are some subtle problems, and it is strange. For the second question, what the code I referenced is the COLAB version; you can check out that as " the last linear layer of SIREN also divides omega in initialization without multiple omega in the calculation".

ivanstepanovftw commented 6 months ago

@Luciennnnnnn I have tried reproduce authors activation distribution after dot product and nonlinearity. This is what I got image (1)

ivanstepanovftw commented 4 months ago

@vsitzmann, hello! Can you answer questions regarding your paper and implementation inconsistencies? Only you know your work better than anyone.

vsitzmann / siren

Question about first layer #49