mct10 / RepCodec

Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
Other
147 stars 10 forks source link

The vocoder for the released repCodec model #5

Closed draplater closed 2 months ago

draplater commented 4 months ago

I find that only encoder part of repCodec model is released. I'm unable to recover the sound from the tokens without vocoder. Do you have a plan to release the vocoder?

mct10 commented 3 months ago

Hi, we released a vocoder trained on RepCodec units of HuBERT large layer 18 features. You can find it here. For other type of RepCodec units and datasets, you can train your own vocoders following the instructions here or the official speech-resynthesis project.

Irving-ren commented 2 months ago

@mct10 Hi,great works!I was trying to reconstruct the wav from Hubert large layer 18 tokens feature on here, which was released on RepCodec. However, it occurred the error that was missing

[Errno 2] No such file or directory: 'checkpoints/vctk_f0_vq/g_00400000

Is the f0 quantizer necessary for generating wav? Could you give some clues about it? Thanks a lot.

mct10 commented 2 months ago

Hi, thanks for reporting it. Yes the f0 quantizer is needed. I just uploaded it (link is here). Can you try again? You might need to modify f0_quantizer_path in the config, by the way.

Irving-ren commented 2 months ago

Thanks for your response. After adjusting the correct f0 quantizer and its config, It seems still to be errored when I extracting tokens from hubert latent. The specific traceback as followings:

File "/opt/conda/lib/python3.8/site-packages/numpy/lib/arraypad.py", line 743, in pad pad_width = _as_pairs(pad_width, array.ndim, as_index=True) File "/opt/conda/lib/python3.8/site-packages/numpy/lib/arraypad.py", line 518, in _as_pairs return np.broadcast_to(x, (ndim, 2)).tolist() File "<__array_function__ internals>", line 180, in broadcast_to File "/opt/conda/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 413, in broadcast_to return _broadcast_to(array, shape, subok=subok, readonly=True) File "/opt/conda/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 349, in _broadcast_to it = np.nditer( ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,2) and requested shape (3,2) 0%|

It seems like something wrong with padding. Have you encountered this error? Thanks for your kindness

mct10 commented 2 months ago

That's a bit weird..

It seems still to be errored when I extracting tokens from hubert latent.

To clarify, the error occurred when you were extracting HuBERT features, or when you were applying RepCodec to extracted HuBERT features? It is not related to vocoder and speech resynthesis then? Did you use the script we provide? If so, do you know which line exactly caused the error? Also, which dataset are you using?

Irving-ren commented 2 months ago

Sorry for late reply. BTW, I was fixed the above padding error on step1. However, still stuck at step 3. Let me show the specific step I was running:

step 0: prepare tsv file step 1. using _dumpfeature.py to extarct 0_1.len and 0_1.npy files, which was saved with the latent presentation(using hubert on this case). Is that correct or not that 0_1.npy file content was stored the latent presentation? step 2: using _repcodectokenizer.py to get the token, right? (using hubert_large_l18.pkl)

For now, I was facing the following error while running step 2:

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 310, in forward return self._conv_forward(input, self.weight, self.bias) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [1024, 1024, 3], expected input[3, 32, 583] to have 1024 channels, but got 32 channels instead

Am i right about the above understanding of each step? If correct, do you have any comments about this channel error? Please correct me if wrong, Thanks for you patience.

mct10 commented 2 months ago

It looks like you are doing everything correctly...

Is that correct or not that 0_1.npy file content was stored the latent presentation?

Yes.

using repcodec_tokenizer.py to get the token, right? (using hubert_large_l18.pkl)

I think you mean repcodec/tokenize.py. It is correct to use hubert_large_l18.pkl.

But it seems like the input representation has a wrong dimension (3, 32, 583). I am not sure why. Do you mind sharing the corresponding speech file that caused this error? Also, you can double check the dimension of 0_1.npy, which should be (N, 1024) where N is the number of speech files you have in the tsv file.

Irving-ren commented 2 months ago

But it seems like the input representation has a wrong dimension (3, 32, 583). I am not sure why. Do you mind sharing the corresponding speech file that caused this error?

Yes, I was noticed the possible reason for this one. Actually, the error occurred when extracting the Hubert latent(0_1.npy).

After install sox-devel pkg, it seems to be errored when running into this line.

wav = get_features_or_waveform(path, need_waveform=True, use_sample_rate=self.task.cfg.sample_rate)

The outside log was located at this line as following:

wav = get_features_or_waveform(path, need_waveform=True, use_sample_rate=self.task.cfg.sample_rate)

The inside specific log as following:

> ipdb> n
> /opt/conda/lib/python3.8/site-packages/fairseq/data/audio/audio_utils.py(168)get_features_or_waveform()
    167         if need_waveform:
--> 168             return get_waveform(
    169                 _path, always_2d=False, output_sample_rate=use_sample_rate

ipdb> n
Segmentation fault (core dumped)

This error was strange, which might be Mem OOM using sox.effect function i guess.

For solving this issue, I replaced _get_features_orwaveform with soundfile.read, which this function was used for loading wav file simply, correct? And, the Segmentation fault error was solved. However, the saved numpy array shape dimension(Sequence, Batch, Dim) is incorrect. Obviously, the dim should be 1024.

print(test.shape) (1146, 1, 32)

Do you have any suggestions about this one?Thanks for your support.

mct10 commented 2 months ago

I think it's fine to use soundfile.read(). Can you check the shape of the output of read_audio() function (https://github.com/mct10/RepCodec/blob/main/examples/hubert_feature_reader.py#L37)? It should be (N, ) where N is the length of raw audio. Can you also check the shape of the output of get_feats() function (https://github.com/mct10/RepCodec/blob/main/examples/hubert_feature_reader.py#L46)? It should be (M, 1024), where M is the sequence length (~N/320).

Irving-ren commented 2 months ago

(https://github.com/mct10/RepCodec/blob/main/examples/hubert_feature_reader.py#L37)? It should be (N, ) where N is

After resampling to 16k, the length of audio was showed as:

ipdb> pp wav.shape torch.Size([48427])

And the related feat's shape was showed as:

ipdb> pp feat_dict.keys() dict_keys(['encoder_out', 'encoder_padding_mask', 'padding_mask']) ipdb> pp feat_dict["encoder_out"].shape torch.Size([151, 1, 32])

Which the Seq was correct, Dim is Not. Is this related with default config?I guess the config was saved on Hubert model, right?

mct10 commented 2 months ago

Is this related with default config?I guess the config was saved on Hubert model, right?

It shouldn't be related to config.

But are you using Fairseq to generate the representations, like the script I provided? Because I'm not sure how other methods work...

Irving-ren commented 2 months ago

Sure, here is the _getfeat function:

def get_feats(self, path, ref_len=None):

x = self.read_audio(path, ref_len=ref_len) import ipdb;ipdb.set_trace() with torch.no_grad():

x = torch.from_numpy(x).float().to(self.device)

x = x.float().to(self.device) if self.task.cfg.normalize: x = F.layer_norm(x, x.shape) x = x.view(1, -1)

        feat = []
        for start in range(0, x.size(1), self.max_chunk):
            x_chunk = x[:, start: start + self.max_chunk]
            feat_dict = self.model.extract_features(
                source=x_chunk,
                padding_mask=None,
                mask=False,
                output_layer=self.layer,
            )
            feat_list = list(feat_dict.values())
            feat_chunk = feat_list[0]
            feat.append(feat_chunk)
    return torch.cat(feat, 1).squeeze(0)

the main running extracting script was original script from RepCodec, which was adjusted with minor output. BTW, other script was completely same besides the resampling part.

mct10 commented 2 months ago

What confuses me is the output of self.model.extract_features. It shouldn't be a dict. It should be a tuple according to https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/hubert/hubert.py#L533. Why did you change to dict? Was there an error when parsing the result as a tuple? Did you download the HuBERT model from here?

Irving-ren commented 2 months ago

Yes, I was using the Hubert model from here. However, it was errored when using original script, which output the tuple type. So, I changed it for general running, And it still used fairseq to infer for sure.

mct10 commented 2 months ago

Ok, then I suspect there is something wrong when running self.model.extract_features. You may need to double check the modified script.

Irving-ren commented 2 months ago

If using the original script, it show the following error when running _self.model.extractfeatures . Could you tell how to solve this error without changing output type into dict? Thanks a lot

ValueError: too many values to unpack (expected 2)

mct10 commented 2 months ago

This is also weird because I didn't get this error... Can you tell me what does self.model.extract_features return exactly? Also, what version of fairseq are you using?

Irving-ren commented 2 months ago
         68                 x_chunk = x[:, start: start + self.max_chunk]
    ---> 69                 feat_chunk, _ = self.model.extract_features(
         70                     source=x_chunk,

  ipdb> s
  --Call--
  > /opt/conda/lib/python3.8/site-packages/fairseq/models/fairseq_model.py(95)extract_features()
       94 
  ---> 95     def extract_features(self, *args, **kwargs):
       96         """Similar to *forward* but only return features."""

  ipdb> l
       90                 return F.log_softmax(logits, dim=-1)
       91             else:
       92                 return F.softmax(logits, dim=-1)
       93         raise NotImplementedError
       94 
  ---> 95     def extract_features(self, *args, **kwargs):
       96         """Similar to *forward* but only return features."""
       97         return self(*args, **kwargs)
       98 
       99     def max_positions(self):
      100         """Maximum length supported by the model."""

  ipdb> n
  > /opt/conda/lib/python3.8/site-packages/fairseq/models/fairseq_model.py(97)extract_features()
       96         """Similar to *forward* but only return features."""
  ---> 97         return self(*args, **kwargs)
       98 

  ipdb> n
  --Return--
  {'encoder_out': tensor([[[  9...vice='cuda:0'), 'encoder_padding_mask': None, 'padding_mask': None}
  > /opt/conda/lib/python3.8/site-packages/fairseq/models/fairseq_model.py(97)extract_features()
       96         """Similar to *forward* but only return features."""
  ---> 97         return self(*args, **kwargs)

  ipdb> test = self(*args, **kwargs)
  ipdb> pp test["encoder_out"].shape
  torch.Size([151, 1, 32])

  ipdb> n
  ValueError: too many values to unpack (expected 2)

Still get dim 32 when using original code, using fairseq==0.12.2. Which version should be used in original script?

mct10 commented 2 months ago

I am now suspecting you may have used the wrong HuBERT model. Can you double check the link you used for downloading HuBERT? Note that you should download the one without finetuning, i.e., hubert_large_ll60k.pt.

Irving-ren commented 2 months ago

It turns out to be correct, which using Hubert model is wrong.. Sorry about that. Now, I got the results of Copy-sys. Here are some doubts:

  1. What's the difference between _gen.wav and _0gen.wav ? It seems like exactly same via Mel
  2. the gen.wav was missing timbre info when comparing with GT wav,which only contain Content info and the global prosody. Is there no timbre info as extra input when reconstructing wav from token? Ex: x-vector.

From here, It contained the timbre info for sure. Is there any update on model structure with original Speech-Resynthesis repo?

image
mct10 commented 2 months ago

_gen.wav uses the groundtruth speaker, while _{k}_gen.wav uses a randomly sampled speaker. We use default setups in speech-resynthesis without any modifications, so you can refer to the speech-resynthesis project to know more details.