I see you use through the whole code 1-d convolutions, which technically should perform the same as Dense layers. The main difference I found on the internet was a shorter computing time for the Dense layer. Hence why I am wondering why you use 1D convolutions here?
https://github.com/openai/finetune-transformer-lm/blob/a69b5c43b0452462890bca8ff92fb75dee9290cf/train.py#L106
I see you use through the whole code 1-d convolutions, which technically should perform the same as Dense layers. The main difference I found on the internet was a shorter computing time for the Dense layer. Hence why I am wondering why you use 1D convolutions here?
Cheers