Closed sungonce closed 1 year ago
Hi, the training parameters detailed in Sec 3.4 (page 6) of the paper should be the same as the defaults in the code:
I think the line you are referencing (batch size of 180 for 1 epoch) is from our other paper Fromage. If you're interested in that work, the code is available in this repo.
@kohjingyu I'm sorry. As you say, the training settings I mentioned (batch size of 180 for 1 epoch) came up with a different paper, not GILL. But I'm still a little puzzled. In your code, the model learns 90 epochs and each epoch consists of 2000 steps, so it processes 180000 (180K) steps in total. Is 180K steps in your code equivalent to 20K iterations in your paper?
Ah yes, you are right. The default setting of epochs should actually be 10 (10 * 2000 = 20K iterations), which is now updated by a18045a089b5d6c75e6ad848dcb9f6ee8ac19c18. Thanks for catching this!
Thank you for your quick reply! I look forward to seeing the results of your research.
Hi. Thank you for sharing your great research! Your work is very inspiring. I have a question about the training settings in the code to reproduce your paper.
Here's what your paper says: We trained the model with batch size 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (24 clock hours).
and parameters in your code are defined by follows (main.py): parser.add_argument('--epochs', default=90, type=int, metavar='N', help='number of total epochs to run') parser.add_argument('--steps_per_epoch', default=2000, type=int, metavar='N', help='number of training steps per epoch') parser.add_argument('--start-epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)') parser.add_argument('--val_steps_per_epoch', default=-1, type=int, metavar='N', help='number of validation steps per epoch') parser.add_argument('-b', '--batch-size', default=200, type=int, metavar='N', help='mini-batch size (default: 200), this is the total ' 'batch size of all GPUs on the current node when ' 'using Data Parallel or Distributed Data Parallel')
I think there is a difference in the training parameters (especially batch size and iteration) between paper and code, can you clarify this?