Query about the training parameters

kyriemao commented 9 months ago

Hi jxmorris,

I have a query regarding the training parameters, specifically the batch size, epochs, and learning rate.

Upon reviewing your provided scripts, I observed that you employed per_device_train_batch_size=128 and num_train_epochs=100 for training the inversion model. Conversely, for the corrector model, you utilized per_device_train_batch_size=32 and num_train_epochs=100. In your published paper, you stated a batch size of 128 and mentioned using 4 A6000 GPUs for training over a maximum duration of 2 days.

In my experimental setup, I am attempting to train the inversion model using 4 A100 40G GPUs. However, I have noticed that it takes approximately a week to train the inversion model alone. Consequently, I am inquiring whether the per_device_train_batch_size be 128 or be 32 for the inversion model training.

Furthermore, there seems to be a disparity in the learning rates between your scripts, where you used 1e-3, and the information presented in your paper, indicating a learning rate of 2e-4. Could you please advise on the recommended learning rate for optimal results in my experiments?

Thanks a lot!

jxmorris12 commented 9 months ago

Hi! I think I tried to keep the hyperparameters consistent with the batch size. Which of the models are you training? I think you should use the largest batch size that fits on your GPUs; for A6000s in bf16, if you're training the sl-32 inverter and precomputing embeddings, I think you can fit a batch size of 512 per GPU. I actually just did this and it took less than a week but perhaps more than two days.

The OpenAI sl128 correction models certainly take a longer time to converge and we should probably update the paper with that detail. If that's the one you're concerned with let me know and I can provide more details.

kyriemao commented 9 months ago

Hi! I think I tried to keep the hyperparameters consistent with the batch size. Which of the models are you training? I think you should use the largest batch size that fits on your GPUs; for A6000s in bf16, if you're training the sl-32 inverter and precomputing embeddings, I think you can fit a batch size of 512 per GPU. I actually just did this and it took less than a week but perhaps more than two days.

The OpenAI sl128 correction models certainly take a longer time to converge and we should probably update the paper with that detail. If that's the one you're concerned with let me know and I can provide more details.

Thanks for your response! I am training a GTR-based inverter with 128 input length. Actually, you suggested batch size settings in README.md can just fit my GPU. I think I can first use your default settings in README.md to see the performance. Thanks!

jxmorris12 / vec2text

Query about the training parameters #13