jasonwu0731 / ToD-BERT

Pre-Trained Models for ToD-BERT
BSD 2-Clause "Simplified" License
291 stars 55 forks source link

Distributed training forthe RCL task #24

Open JadinTredupLP opened 2 years ago

JadinTredupLP commented 2 years ago

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

jasonwu0731 commented 2 years ago

Hi,

Can you provide what is the error when you run the RCL training? We did not focus too much on parallel training at that time and used the huggingface implementation for that.

JadinTredupLP commented 2 years ago

I am not getting an error really, just the RS loss is not decreasing when it gets distributed. On a single GPU it converges fine (for a small amount of data) and for the same amount the distributed training did not converge at all.