lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Inconsistency about dimensions #174

Closed yuto-nozaki closed 10 months ago

yuto-nozaki commented 10 months ago

I am trying to run this project using a dataset that we have prepared.

After modifying( https://github.com/lifeiteng/vall-e/blob/main/egs/libritts/prepare.sh ) for our dataset and running it, the following error occurred:

Computing features in batches:   7%|█████▉ | 1001/89992
Traceback (most recent call last):
  File "/home/user/VALL-E-X/egs/my_data/bin/tokenizer.py", line 276, in <module>
    main()
  File "/home/user/VALL-E-X/egs/my_data/bin/tokenizer.py", line 199, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/home/user/python3.10/site-packages/lhotse/cut/set.py", line 2308, in compute_and_store_features_batch
    features = extractor.extract_batch(
  File "/home/user/valle/data/tokenizer.py", line 332, in extract_batch
    samples, lengths = self.pad_tensor_list(samples, device)
  File "/home/user/valle/data/tokenizer.py", line 324, in pad_tensor_list
    padded_tensor = torch.nn.utils.rnn.pad_sequence(
  File "/home/user/python3.10/site-packages/torch/nn/utils/rnn.py", line 398, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: The size of tensor a (2) must match the size of tensor b (706) at non-singleton dimension 1

Upon inspection, it seems that the error is occurring in the function compute_and_store_features_batch located at https://github.com/lhotse-speech/lhotse/blob/db40bc4e8595c0c3c1a418da200848e58df5b1c8/lhotse/cut/set.py#L1968.

As the root cause, it appears that the dimensions of the variable 'waves' created between lines https://github.com/lhotse-speech/lhotse/blob/db40bc4e8595c0c3c1a418da200848e58df5b1c8/lhotse/cut/set.py#L2127-L2150 seem to be incorrect.

Specifically, when I run:

for i in waves:
   print(i.shape)

I notice that there are a few instances with a dimension of 2, as shown below:

torch.Size([8561])
torch.Size([4882])
torch.Size([32987])
torch.Size([881])
torch.Size([6682])
torch.Size([9713])
torch.Size([2, 706])
torch.Size([806])
....
torch.Size([6683])

Is there anyone who knows the reason for this inconsistency about dimensions?

lhotse version: v1.17

FrancescoVV commented 10 months ago

Is it possible that some of your audio files have more than one channel?

yuto-nozaki commented 10 months ago

Thank you very much for answering. I'll search for it.

yuto-nozaki commented 10 months ago

@FrancescoVV Thank you, I solved the problem. As you said, some of your audio files have multiple channels.