Closed JLU-Neal closed 3 years ago
For the first question, since the code of Menghao is directly borrowed and translated from the author's official Jittor implementation (https://github.com/MenghaoGuo/PCT) by the time I implemented this, I don't know if the author is intended to use max pooling or if there is little performance difference between MA-Pool and max pooling.
For the second question, yes, the training schedule for all three methods are different. From my personal view, any over-tuned schedule may hide the actual network architecture contribution. Anyway, I would appreciate if someone could retrain all these models with a cosine annealing schedule, to really compare them in a fair setting.
1) In the paper, the output of the encoder would be processed by a MA-Pool layer, instead of a single max pooling. 2) The initial learning rate in paper is 0.01, with a cosine annealing schedule.