Closed hexastrayer closed 7 months ago
After running for a period of time (200 step), the speed speeds up to about 3 seconds per step, but I think it is still relatively slow. Specifically, the model's forward process and the "self.accelerator.backward(total_loss)" step within the "_train_step" function each consumes approximately 1.5 seconds.
Hi, I think 1.5s per step is a normal speed for V100. The main reason affecting training speed may stem from IO reading, especially if your data is stored in the cloud rather than on a fast disk with read/write capabilities. One feasible solution is to preload all data into memory beforehand.
Thank you for your careful answer. I notice that the weight of diff_ce loss is set to 0.5 (0.1 in origin paper) and the diff_loss is set to L1 loss (L2 in origin paper). Are these the optimal hyperparameters after your experiments, or would the ones in the original article be better?
@hexastrayer How do you manage to train NS2? I've seen that there still be mismatch in data preprocessing part mentioned in https://github.com/open-mmlab/Amphion/issues/43. Could you please create PR for this? Thanks a lot.
@dongngm I did not use the code in Amphion for data preprocessing and dataset/loader. I used my own logic to provide relevant data needed in _train_step function. Maybe it would be easier for u to rewrite ns2_dataset.py based on this. I made a lot of changes locally, so it’s not easy to create PR.
Hi, @hexastrayer , If you have any further questions about Natural Speech2 Training Speed, feel free to re-open this issue. We are glad to follow up!
Hi @hexastrayer Can you share the pre-training model? Training really takes too long time.
@a897456 Hi, we have provided the pre-trained checkpoint https://huggingface.co/amphion/naturalspeech2_libritts
I'm interested in the training time for NS2. I'm currently utilizing the accelerate launch with a batch size of 16 across 8 Tesla V100 GPUs. However, each step takes approximately 5 seconds. I noticed that the checkpoint you supplied corresponds to 500k steps, potentially extending the training time to over 20 days. Is this training time normal, or is there something wrong?