unilight / seq2seq-vc

A sequence-to-sequence voice conversion toolkit.
MIT License
84 stars 10 forks source link

l2-arctic/cascade vocoder issue after stage 6 #12

Closed KevinGengGavo closed 3 months ago

KevinGengGavo commented 5 months ago

Hi @unilight, long time no see. Congratulation for graduation! I should call you sensei now!

Issue

I finished --stage -1 to stage 5 and generated promising, non-accented bdl-like voices. However, during the non-parallel conversion, I don't think we can get access to the original vocoder config.

vocoder:
  checkpoint: /data/group1/z44476r/Experiments/ParallelWaveGAN/egs/l2-arctic/voc1/exp/train_nodev_TXHC_parallel_wavegan.v1/checkpoint-105000steps.pkl

Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.

Looking forward to your reply!

KevinGengGavo commented 5 months ago

Also I wonder if the --norm_name self in stage 4 is necessary.

Though you mentioned that in README.md, the default norm_name before stage 3 was ljspeech, so there will only be dump/*/norm_ljspeech rather than dump/*/norm_self for stage 4 decoding.

Should I ignore this?

unilight commented 5 months ago

Hi @KevinGengGavo,

long time no see. Congratulation for graduation! I should call you sensei now!

Sorry I am not quite sure who you are... but thank you :)

Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.

Yes! Please do so.

Also I wonder if the --norm_name self in stage 4 is necessary.

You are right, it is not needed... Sorry I didn't check carefully (and obviously you are probably the only person so far that has tried the implementation...

KevinGengGavo commented 5 months ago

Hi, I’ve tried the pwg_TXHC vocoder after stage 5, and it's somehow working now. However, the artifacts after stage 6 are more than I expected.

Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.

The spectrogram doesn't look the same between stage 5 output and stage 6 input, I wonder if it's due to a normalization error.

I would appreciate if you can help.

seq2seq-vc_isuues.zip

KevinGengGavo commented 5 months ago

Beside, a version conflict occurred during stage 6 at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.

Here is my Pytorch env.

torch                    2.0.1
torch-complex            0.4.3
torchaudio               2.0.2

Based on torch.stft, the return_complex parameter is required, while the original implementation ignored this.

I set return_complex=False, and you can see the output of ComplexTensor in nvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.

KevinGengGavo commented 5 months ago

Hi @unilight, I would appreciate it if you could take a time to look at this. Thank you.

Jasmijn888 commented 5 months ago

Beside, a version conflict occurred during stage 6 at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95.

Here is my Pytorch env.

torch                    2.0.1
torch-complex            0.4.3
torchaudio               2.0.2

Based on torch.stft, the return_complex parameter is required, while the original implementation ignored this.

I set return_complex=False, and you can see the output of ComplexTensor in nvpc_decode.log. I'm not sure if this is correct, but it is the only method that allows the code to be executable.

I got the same issue. you can try add this block at line 641 if not return_complex: return torch.view_as_real(_VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] normalized, onesided, return_complex=True))

This works for me. Good luck

KevinGengGavo commented 5 months ago

Hi @Jasmijn888 Thanks for your reply. I fixed this issue by adding torch.view_as_real at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py.

Here is my line 69 to 96

        # or (Batch, Channel, Freq, Frames, 2=real_imag)
        if not self.kaldi_padding_mode:
            output = torch.stft(
                input,
                n_fft=self.n_fft,
                win_length=self.win_length,
                hop_length=self.hop_length,
                center=self.center,
                pad_mode=self.pad_mode,
                normalized=self.normalized,
                onesided=self.onesided,
                return_complex=True,
            )
        else:
            # NOTE(sx): Use Kaldi-fasion padding, maybe wrong
            num_pads = self.n_fft - self.win_length
            input = torch.nn.functional.pad(input, (num_pads, 0))
            output = torch.stft(
                input,
                n_fft=self.n_fft,
                win_length=self.win_length,
                hop_length=self.hop_length,
                center=False,
                pad_mode=self.pad_mode,
                normalized=self.normalized,
                onesided=self.onesided,
                return_complex=True,
            )
        # Change complex output to real and imag parts
        output = torch.view_as_real(output)

I don't recommend modifying PyTorch source code anyway. However, thanks for your feedback!

KevinGengGavo commented 5 months ago

@Jasmijn888 I'm more curious about you output after stage 6, how does it sound like?

Hi, I’ve tried the pwg_TXHC vocoder after stage 5, and it's somehow working now. However, the artifacts after stage 6 are more than I expected.

Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval.

The spectrogram doesn't look the same between stage 5 output and stage 6 input, I wonder if it's due to a normalization error.

I would appreciate if you can help.

seq2seq-vc_isuues.zip

After stages 5 and 6, my mel output appears to be fine, but the wav output seems to be overflowing. Would you mind reviewing your output as well?

unilight commented 5 months ago

Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?

Jasmijn888 commented 5 months ago

@unilight Hi, Dr. Huang! In the cascade method, since mel spectrograms are used for feature extraction, I assume the feature extraction model is language-independent. If I want to train an accent conversion model on a different language, can I start here by using the provided model for feature extraction? Thanks!

unilight commented 5 months ago

Hi @Jasmijn888,

the mel spectrogram is indeed language independent, so you can use it for any language. Though I don’t quite understand what you mean by “ start here by using the provided model for feature extraction”. If you want to use your own dataset, you need to use your desired dataset to train (1) a neural vocoder (ex. ParallelWaveGAN) (2) a non-parallel frame-based model provided by s3prl-vc.

KevinGengGavo commented 5 months ago

Hi @unilight,

Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?

Thanks, I've resolved the STFT problem with the earlier mentioned modification.

I also think I've pinpointed the cause of the audio overflow. I denormalized the output feature in s3prl-vc-decode by modifying the tools/venv/lib/python3.10/site-packages/s3prl_vc/bin/decode.py at line 257.

            # model forward
            out, _, _olens = model(hs, hlens, spk_embs=spemb, f0s=f0s)
            if out.dim() != 2:
                out = out.squeeze(0)

            # try denormalize
            if "s3prl-vc-ppg_sxliu" in args.trg_stats:
                out = out * config["trg_stats"]["scale"] + config["trg_stats"]["mean"]

This adjustment delivered reasonable results for me. The Mean CER and WER are now 30.2 and 52.5, respectively, similar to those reported in your paper. fac_cascade_denormalized.zip

I'm not sure if there was an issue during my data processing. I'll check if stg also has this same problem.

unilight commented 3 months ago

@KevinGengGavo This is indeed a bug in the s3prl_vc package, and the solution is indeed to add the line to denormalize the feature. I have fixed it and published the latest s3prl_vc package. If anyone is still having this issue, make sure to update the s3prl_vc package to 0.3.1. Thanks!