Clarification on figure 3(a)

alanjeffares commented 3 years ago

Hello and thanks for the paper and the helpful codebase!

I just wanted to clarify how the convergence plots in the paper were generated, particularly fig 3(a). The Y axis is labelled test accuracy, however the X values seems to be more frequent than every epoch. Could you confirm what data is being evaluated on here and if smoothing is taking place? Thanks

jerrybai1995 commented 3 years ago

I don't believe smoothing was applied (e.g., as you can see there is a huge gap for the red LSTM line). My impression is that I simply evaluated the test set very frequently, likely more than once within each epoch.

alanjeffares commented 3 years ago

Thanks for the response.

That is surprising as when I have tried to reproduce on the test, I obtain a comparable final result but the learning rate seems to be more steady. The plot below contains 10 runs (seeds 1:10) with the parameters reported in the paper (python pmnist_test.py --seed=1 --epochs=10 --dropout=0.0 --lr=0.001 --optim='RMSprop' --nhid=130 --clip=1). The mean is plotted in dark blue.

And here is your plot again for reference:

I generated this by swapping out the TCN model with the following code:

class LSTM(nn.Module):
    def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
        super(LSTM, self).__init__()
        self.nhid = num_channels[-1]
        self.lstm = nn.LSTM(input_size, num_channels[-1], batch_first=True, dropout=dropout)
        # initialise forget gate bias to 1
        # bias_ih contains (b_ii|b_if|b_ig|b_io)
        # bias hh contains (b_hi|b_hf|b_hg|b_ho)
        self.lstm._parameters['bias_ih_l0'].data[self.nhid:self.nhid*2] = 1
        self.lstm._parameters['bias_hh_l0'].data[self.nhid:self.nhid*2] = 1
        self.linear = nn.Linear(num_channels[-1], output_size)

    def forward(self, inputs):
        """Inputs have to have dimension (N, C_in, L_in)"""
        # permute inputs for LSTM in pytorch
        inputs = inputs.permute(0,2,1)

        y1, _ = self.lstm(inputs)  # input should have dimension (B, S, I)
        o = self.linear(y1[:, -1, :])
        return F.log_softmax(o, dim=1)

I suspect I have missed some detail from the paper as the learning rate you reported seems to be much steeper. One possibility is that we are plotting different X axis. I ran for 10 epochs as suggested in the text and I was not clear on what iterations represented in your figure.

jerrybai1995 commented 3 years ago

I don't have a good explanation for this since, as we noted in the paper, LSTM's convergence does depend on a bunch of stuff like initial forget gate bias, number of layers, etc. As I didn't keep the LSTM code that I used to run Fig. 3(a), I cannot exactly tell where the problem is.

alanjeffares commented 3 years ago

No problem. I suspect it is parameter initialisation but wanted to check in case I was missing an important detail from the paper. I have attached a plot of the training accuracy for those 10 iterations below (on interval sizes of 1000 observations) for reference in case anyone stumbles across this issue in the future.

locuslab / TCN

Clarification on figure 3(a) #64