Missing softmax after Linear layer

Hi,

Thanks for the amazing work.

When training with the code, I got large numbers of output values in CoarseTransformer for coarse_logits. According to the transformer architecture in the original paper, it's a softmax after linear. However, the softmax is missing here. This results in large values of the output logits: https://github.com/lucidrains/audiolm-pytorch/blob/1a888d2f462384baf5dc8b4782f39a40f59593b7/audiolm_pytorch/audiolm_pytorch.py#L924

This unnormalized logits will effectively disable the gumbel_sample() since the function adds the normalized noise to the logits. https://github.com/lucidrains/audiolm-pytorch/blob/1a888d2f462384baf5dc8b4782f39a40f59593b7/audiolm_pytorch/audiolm_pytorch.py#L1655

Is the softmax layer missing here?

lucidrains / audiolm-pytorch