lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Why DynamicBucketingSampler is used in default setting? #177

Closed craggy-otake closed 9 months ago

craggy-otake commented 9 months ago

Thanks for nice repository. In VALL-E, it is important that there is same speakers voice in a batch when you train NAR model. However, you used DynamicBucketingSampler. When we use it, the data is sorted by duration. Therefore, the batch is constructed from different speakers voices.

valle/data/datamodule.py

if self.args.bucketing_sampler:
            logging.info("Using DynamicBucketingSampler")
            train_sampler = DynamicBucketingSampler(
                cuts_train,
                max_duration=self.args.max_duration,
                shuffle=self.args.shuffle,
                num_buckets=self.args.num_buckets,
                drop_last=True,
            )
craggy-otake commented 9 months ago

Sorry, you are excellent. By reading carefully your code, I understand what you did in valle.py. In a single audio, you split it to ref and target in forward function.(when training) Thank you.