convert multi-channel input into single-channel

Hi, I and my Japanese voice AI enthusiast community like this product.

This is trivial PR. If you don't need, please close this.

Abstract

If input audio is stereo, I get the following error.

(venv) D:\NeuCoSVC>python infer.py --src_wav_path input.wav --ref_wav_path ref.wav --out_path out --speech_enroll
using cuda for inference.
Loading svc model configurations.
wavlm loaded.
loading models cost 6.86s.
Processing feats.
The wav file input.wav has 2 channels, select the first one to proceed.
D:\NeuCoSVC\venv\lib\site-packages\librosa\core\convert.py:1332: RuntimeWarning: divide by zero encountered in log10
  + 2 * np.log10(f_sq)
The wav file input.wav has 2 channels, select the first one to proceed.
pitch shift factor: 1.10
Original audio sr is 24000, change it to 16000.
Traceback (most recent call last):
  File "D:\NeuCoSVC\infer.py", line 153, in <module>
    VoiceConverter(test_utt=args.src_wav_path, ref_utt=args.ref_wav_path, out_path=args.out_path,
  File "D:\NeuCoSVC\infer.py", line 44, in VoiceConverter
    query_feats = wavlm_encoder.get_features(test_utt, weights=applied_weights)
  File "D:\NeuCoSVC\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\NeuCoSVC\modules\wavlm_encoder.py", line 97, in get_features
    features = (features*weights[:, None] ).sum(dim=0) # (1, seq_len, dim)
RuntimeError: The size of tensor a (50) must match the size of tensor b (25) at non-singleton dimension 0

The log The wav file input.wav has 2 channels, select the first one to proceed. and doc string test_utt (str): Path to the source singing waveform (24kHz, single-channel). tell us that input audio should be single-channel, but final error message is difficult for me.

The other processes select the first one to proceed in case of multi-channel. So, infer selects the same. (Or, raising error message of input audio should be single-channel is better?)

thuhcsi / NeuCoSVC

convert multi-channel input into single-channel #3

Abstract