Closed KevinGengGavo closed 3 months ago
Also I wonder if the --norm_name self
in stage 4
is necessary.
Though you mentioned that in README.md
, the default norm_name
before stage 3
was ljspeech
, so there will only be dump/*/norm_ljspeech
rather than dump/*/norm_self
for stage 4
decoding.
Should I ignore this?
Hi @KevinGengGavo,
long time no see. Congratulation for graduation! I should call you sensei now!
Sorry I am not quite sure who you are... but thank you :)
Shall we change this to our local pwg_TXHC? As you mentioned in lsc/README.md.
Yes! Please do so.
Also I wonder if the --norm_name self in stage 4 is necessary.
You are right, it is not needed... Sorry I didn't check carefully (and obviously you are probably the only person so far that has tried the implementation...
Hi, I’ve tried the pwg_TXHC
vocoder after stage 5
, and it's somehow working now. However, the artifacts after stage 6
are more than I expected.
Here, I have attached several samples in egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval
and egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval
.
The spectrogram doesn't look the same between stage 5
output and stage 6
input, I wonder if it's due to a normalization error.
I would appreciate if you can help.
Beside, a version conflict occurred during stage 6
at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95
.
Here is my Pytorch env.
torch 2.0.1
torch-complex 0.4.3
torchaudio 2.0.2
Based on torch.stft, the return_complex
parameter is required, while the original implementation ignored this.
I set return_complex=False
, and you can see the output of ComplexTensor
in nvpc_decode.log
. I'm not sure if this is correct, but it is the only method that allows the code to be executable.
Hi @unilight, I would appreciate it if you could take a time to look at this. Thank you.
Beside, a version conflict occurred during
stage 6
attools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py, line 95
.Here is my Pytorch env.
torch 2.0.1 torch-complex 0.4.3 torchaudio 2.0.2
Based on torch.stft, the
return_complex
parameter is required, while the original implementation ignored this.I set
return_complex=False
, and you can see the output ofComplexTensor
innvpc_decode.log
. I'm not sure if this is correct, but it is the only method that allows the code to be executable.
I got the same issue. you can try add this block at line 641 if not return_complex: return torch.view_as_real(_VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] normalized, onesided, return_complex=True))
This works for me. Good luck
Hi @Jasmijn888
Thanks for your reply.
I fixed this issue by adding torch.view_as_real
at tools/venv/lib/python3.10/site-packages/s3prl_vc/upstream/ppg_sxliu/stft.py
.
Here is my line 69 to 96
# or (Batch, Channel, Freq, Frames, 2=real_imag)
if not self.kaldi_padding_mode:
output = torch.stft(
input,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length,
center=self.center,
pad_mode=self.pad_mode,
normalized=self.normalized,
onesided=self.onesided,
return_complex=True,
)
else:
# NOTE(sx): Use Kaldi-fasion padding, maybe wrong
num_pads = self.n_fft - self.win_length
input = torch.nn.functional.pad(input, (num_pads, 0))
output = torch.stft(
input,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length,
center=False,
pad_mode=self.pad_mode,
normalized=self.normalized,
onesided=self.onesided,
return_complex=True,
)
# Change complex output to real and imag parts
output = torch.view_as_real(output)
I don't recommend modifying PyTorch source code anyway. However, thanks for your feedback!
@Jasmijn888 I'm more curious about you output after stage 6, how does it sound like?
Hi, I’ve tried the
pwg_TXHC
vocoder afterstage 5
, and it's somehow working now. However, the artifacts afterstage 6
are more than I expected.Here, I have attached several samples in
egs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/TXHC_eval
andegs/l2-arctic/cascade/exp/TXHC_bdl_1032_vtn.tts_pt.v1/results/checkpoint-50000steps/stage2_ppg_sxliu_checkpoint-50000steps_TXHC_eval
.The spectrogram doesn't look the same between
stage 5
output andstage 6
input, I wonder if it's due to a normalization error.I would appreciate if you can help.
After stages 5 and 6, my mel
output appears to be fine, but the wav
output seems to be overflowing. Would you mind reviewing your output as well?
Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?
@unilight Hi, Dr. Huang! In the cascade method, since mel spectrograms are used for feature extraction, I assume the feature extraction model is language-independent. If I want to train an accent conversion model on a different language, can I start here by using the provided model for feature extraction? Thanks!
Hi @Jasmijn888,
the mel spectrogram is indeed language independent, so you can use it for any language. Though I don’t quite understand what you mean by “ start here by using the provided model for feature extraction”. If you want to use your own dataset, you need to use your desired dataset to train (1) a neural vocoder (ex. ParallelWaveGAN) (2) a non-parallel frame-based model provided by s3prl-vc.
Hi @unilight,
Hi @KevinGengGavo, I've tried to run the code on my local server again but I did not encounter the overflow issue. I can only suspect it's because of the new stft argument setting. Can you follow the official recommendation and try to use torch.view_as_real() to recover the real tensor?
Thanks, I've resolved the STFT problem with the earlier mentioned modification.
I also think I've pinpointed the cause of the audio overflow.
I denormalized the output feature in s3prl-vc-decode
by modifying the tools/venv/lib/python3.10/site-packages/s3prl_vc/bin/decode.py
at line 257.
# model forward
out, _, _olens = model(hs, hlens, spk_embs=spemb, f0s=f0s)
if out.dim() != 2:
out = out.squeeze(0)
# try denormalize
if "s3prl-vc-ppg_sxliu" in args.trg_stats:
out = out * config["trg_stats"]["scale"] + config["trg_stats"]["mean"]
This adjustment delivered reasonable results for me. The Mean CER and WER are now 30.2 and 52.5, respectively, similar to those reported in your paper. fac_cascade_denormalized.zip
I'm not sure if there was an issue during my data processing. I'll check if stg
also has this same problem.
@KevinGengGavo This is indeed a bug in the s3prl_vc package, and the solution is indeed to add the line to denormalize the feature. I have fixed it and published the latest s3prl_vc package. If anyone is still having this issue, make sure to update the s3prl_vc package to 0.3.1. Thanks!
Hi @unilight, long time no see. Congratulation for graduation! I should call you sensei now!
Issue
I finished
--stage -1
tostage 5
and generated promising, non-accentedbdl-like
voices. However, during the non-parallel conversion, I don't think we can get access to the original vocoder config.Shall we change this to our local
pwg_TXHC
? As you mentioned inlsc/README.md
.Looking forward to your reply!