Closed alanjeffares closed 3 years ago
I don't believe smoothing was applied (e.g., as you can see there is a huge gap for the red LSTM line). My impression is that I simply evaluated the test set very frequently, likely more than once within each epoch.
Thanks for the response.
That is surprising as when I have tried to reproduce on the test, I obtain a comparable final result but the learning rate seems to be more steady. The plot below contains 10 runs (seeds 1:10) with the parameters reported in the paper (python pmnist_test.py --seed=1 --epochs=10 --dropout=0.0 --lr=0.001 --optim='RMSprop' --nhid=130 --clip=1
). The mean is plotted in dark blue.
And here is your plot again for reference:
I generated this by swapping out the TCN model with the following code:
class LSTM(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
super(LSTM, self).__init__()
self.nhid = num_channels[-1]
self.lstm = nn.LSTM(input_size, num_channels[-1], batch_first=True, dropout=dropout)
# initialise forget gate bias to 1
# bias_ih contains (b_ii|b_if|b_ig|b_io)
# bias hh contains (b_hi|b_hf|b_hg|b_ho)
self.lstm._parameters['bias_ih_l0'].data[self.nhid:self.nhid*2] = 1
self.lstm._parameters['bias_hh_l0'].data[self.nhid:self.nhid*2] = 1
self.linear = nn.Linear(num_channels[-1], output_size)
def forward(self, inputs):
"""Inputs have to have dimension (N, C_in, L_in)"""
# permute inputs for LSTM in pytorch
inputs = inputs.permute(0,2,1)
y1, _ = self.lstm(inputs) # input should have dimension (B, S, I)
o = self.linear(y1[:, -1, :])
return F.log_softmax(o, dim=1)
I suspect I have missed some detail from the paper as the learning rate you reported seems to be much steeper. One possibility is that we are plotting different X axis. I ran for 10 epochs as suggested in the text and I was not clear on what iterations represented in your figure.
I don't have a good explanation for this since, as we noted in the paper, LSTM's convergence does depend on a bunch of stuff like initial forget gate bias, number of layers, etc. As I didn't keep the LSTM code that I used to run Fig. 3(a), I cannot exactly tell where the problem is.
No problem. I suspect it is parameter initialisation but wanted to check in case I was missing an important detail from the paper. I have attached a plot of the training accuracy for those 10 iterations below (on interval sizes of 1000 observations) for reference in case anyone stumbles across this issue in the future.
Hello and thanks for the paper and the helpful codebase!
I just wanted to clarify how the convergence plots in the paper were generated, particularly fig 3(a). The Y axis is labelled test accuracy, however the X values seems to be more frequent than every epoch. Could you confirm what data is being evaluated on here and if smoothing is taking place? Thanks