jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
656 stars 36 forks source link

encounter shape inconsistent in training 16kHz #19

Closed dyyoungg closed 4 days ago

dyyoungg commented 1 week ago

Thanks for your great work! I want to train wavtokenizer with my own datasets in 16kHZ, but encounter tensor shape incosistent in the following code

 periodicity_loss, pitch_loss, f1_score = calculate_periodicity_metrics(audio_16_khz, audio_hat_16khz)
 periodicity_loss = np.sqrt(((pred_periodicity - true_periodicity) ** 2).mean(axis=1)).mean()
ValueError: operands could not be broadcast together with shapes (2,395) (2,394) 

I checked the model output and origin audio shape, which are 64200 and 64000 respectively. The following is my config

eed_everything: 3407

data:
  class_path: decoder.dataset.VocosDataModule
  init_args:
    train_params:
      filelist_path: /data/giga.txt
      sampling_rate: 16000 # modified
      num_samples: 64000 # modified
      batch_size: 20  # 20
      num_workers: 8

    val_params:
      filelist_path: /data/test.txt
      sampling_rate: 16000  # modified
      num_samples: 64000 # modified
      batch_size: 2   # 10
      num_workers: 8

model:
  class_path: decoder.experiment.WavTokenizer
  init_args:
    sample_rate: 16000 # modified
    initial_learning_rate: 2e-4
    mel_loss_coeff: 45
    mrd_loss_coeff: 1.0
    num_warmup_steps: 0 # Optimizers warmup steps
    pretrain_mel_steps: 0  # 0 means GAN objective from the first iteration

    # automatic evaluation
    evaluate_utmos: true
    evaluate_pesq: true
    evaluate_periodicty: true

    resume: false
    resume_config: ./configs/wavtokenizer_smalldata_frame40_4s_nq1_code4096_dim512_kmeans200_attn_test.yaml
    resume_model: ./version_3/checkpoints/vocos_checkpoint_epoch=31_step=157696_val_loss=5.9855.ckpt

    feature_extractor:
      class_path: decoder.feature_extractors.EncodecFeatures
      init_args:
        encodec_model: encodec_16khz # modified,
        bandwidths: [6.6, 6.6, 6.6, 6.6]
        train_codebooks: true
        num_quantizers: 1  
        dowmsamples: [6, 5, 5, 4]
        vq_bins: 4096
        vq_kmeans: 200

    backbone:
      class_path: decoder.models.VocosBackbone
      init_args:
        input_channels: 512
        dim: 768
        intermediate_dim: 2304
        num_layers: 12
        adanorm_num_embeddings: 4  

    head:
      class_path: decoder.heads.ISTFTHead
      init_args:
        dim: 768
        n_fft: 2400 
        hop_length: 600
        padding: same

Is there any mistake or misunderstanding in my settings above that causes the output shape of the model to be inconsistent with the input shape?

dyyoungg commented 1 week ago

I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?

jishengpeng commented 1 week ago

I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?

You may observe that downsampling factors must be divisors of the sampling rate. This explains why one configuration is correct, while the other is not.

dyyoungg commented 1 week ago

I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?

You may observe that downsampling factors must be divisors of the sampling rate. This explains why one configuration is correct, while the other is not.

Yeah, I noticed that. Thanks for your reply!