jerryuhoo / VTuberTalk

Apache License 2.0
366 stars 54 forks source link

如何安装MFA? #19

Open yrsn509 opened 2 years ago

yrsn509 commented 2 years ago

使用这行命令速度过慢而且经常报错:conda install montreal-forced-aligner 我就直接在虚拟环境下pip安装了 但是安装不了MFA的第三方库 请问有什么更好的办法

jerryuhoo commented 2 years ago

可以搜索一下conda配置镜像来加速

yrsn509 commented 2 years ago

不行 我清华源和阿里源都试过了 还是用不了

jerryuhoo commented 2 years ago

这行加了吗?conda config --add channels conda-forge

yrsn509 commented 2 years ago

这行加了吗?conda config --add channels conda-forge

那肯定 但是这个源不是国外的吗 速度很慢

yrsn509 commented 2 years ago

这行加了吗?conda config --add channels conda-forge 现在就卡在这里: Solving environment: failed with initial frozen solve. Retrying with flexible solve. Solving environment: /

jerryuhoo commented 2 years ago

我也没遇到过这种情况,看看这里的其他安装方法能否解决? https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html

yrsn509 commented 2 years ago

我也没遇到过这种情况,看看这里的其他安装方法能否解决? https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html

我看了一下官网的说明,貌似windows平台用不了G2P? G2P影不影响声音合成功能?

jerryuhoo commented 2 years ago

不影响,这个是用于对齐音素的,只要有字典就可以。

yrsn509 commented 2 years ago

不影响,这个是用于对齐音素的,只要有字典就可以。

是否需要安装以下这两个东西: mfa models download acoustic mandarin_mfa mfa models download dictionary mandarin_china_mfa

我无法用指令安装,反复提示有网络问题(已科学上网),只好手动github下载了,但不知道放在哪里

yrsn509 commented 2 years ago

不影响,这个是用于对齐音素的,只要有字典就可以。

另外我按照mfa官网安装教程 安装的python版本是3.9 会不会用不了VtuberTalk?

jerryuhoo commented 2 years ago

应该不影响吧。。

yrsn509 commented 2 years ago

应该不影响吧。。

还有TextGrid这个文件夹是第几步生成的?如何自己创建这个文件夹? (因为我打算用Mockingbird的数据集迁移到这里训练,就不想重复执行一些操作了)

jerryuhoo commented 2 years ago

TextGrid是mfa根据拼音生成音素生成的,需要有.wav和.lab文件一一对应。

yrsn509 commented 2 years ago

TextGrid是mfa根据拼音生成音素生成的,需要有.wav和.lab文件一一对应。 就是说TextGrid是mfa处理那步骤才生成的吗,那为什么Readme的2.9就已经出现了TextGrid? .lab文件又是哪里出来的……

jerryuhoo commented 2 years ago

2.9这里应该写错了,不需要TextGrid_temp,lab文件就是一条音频对应的拼音,用于生成TextGrid。

yrsn509 commented 2 years ago

2.9这里应该写错了,不需要TextGrid_temp,lab文件就是一条音频对应的拼音,用于生成TextGrid。

我想问问字幕和lab文件的格式是什么,是这样吗(上面是字幕,下面是lab) 字幕格式:文件名+空格+文本 aa bb

jerryuhoo commented 2 years ago

不是,建议先小规模运行一下run_preprocess.sh,lab和txt还有wav是一一对应的,lab只有拼音。比如音频是你好,lab就是ni3 hao3

yrsn509 commented 2 years ago

不是,建议先小规模运行一下run_preprocess.sh,lab和txt还有wav是一一对应的,lab只有拼音。比如音频是你好,lab就是ni3 hao3

请问我该如何运行.sh,我在VS CODE里貌似打开不了

jerryuhoo commented 2 years ago

windows系统应该不行,可以用docker

jerryuhoo commented 2 years ago

但是拆分成单步可以运行,linux相关的命令windows肯定不支持的

yrsn509 commented 2 years ago

但是拆分成单步可以运行,linux相关的命令windows肯定不支持的

终于能运行到MFA那步了 哭死 想问问数据集多少条能达到效果?是不是越多越好?

yrsn509 commented 2 years ago

但是拆分成单步可以运行,linux相关的命令windows肯定不支持的

这个MFA怎么感觉无穷无尽啊……monophone,triphone,Ida,SAT,SAT_2…… 要到什么阶段才结束?

yrsn509 commented 2 years ago

导出模型时出错: 训练模型:fastspeech2_aishell3_english(叫这名字,但实际是中文) vocoder:pwg_aishell3_ckpt_0.5

voc_config is default! model: fastspeech2, multiple C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\constantq.py:1059: DeprecationWarning: np.complex is a deprecated alias for the builtin complex. To silence this warning, use complex by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.complex128 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations dtype=np.complex, ========Args======== am: fastspeech2_aishell3 am_ckpt: exp/fastspeech2_aishell3_english/checkpoints/snapshot_iter_6300.pdz am_config: exp/fastspeech2_aishell3_english/default_multi.yaml am_stat: exp/fastspeech2_aishell3_english/speech_stats.npy energy_stat: exp/fastspeech2_aishell3_english/energy_stats.npy inference_dir: train/inference lang: zh ngpu: 0 output_dir: train/test_e2e phones_dict: exp/fastspeech2_aishell3_english/phone_id_map.txt pitch_stat: exp/fastspeech2_aishell3_english/pitch_stats.npy speaker_dict: exp/fastspeech2_aishell3_english/speaker_id_map.txt spk_id: 175 text: sentences.txt tones_dict: null use_gst: false use_style: true use_vae: false voc: pwgan_aishell3 voc_ckpt: pretrained_models/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz voc_config: pretrained_models/pwg_aishell3_ckpt_0.5/default.yaml voc_stat: pretrained_models/pwg_aishell3_ckpt_0.5/feats_stats.npy

========Config======== batch_size: 8 f0max: 400 f0min: 80 fmax: 7600 fmin: 80 fs: 24000 max_epoch: 100 model: adim: 384 aheads: 2 decoder_normalize_before: True dlayers: 4 dunits: 1536 duration_predictor_chans: 256 duration_predictor_kernel_size: 3 duration_predictor_layers: 2 elayers: 4 encoder_normalize_before: True energy_embed_dropout: 0.0 energy_embed_kernel_size: 1 energy_predictor_chans: 256 energy_predictor_dropout: 0.5 energy_predictor_kernel_size: 3 energy_predictor_layers: 2 eunits: 1536 init_dec_alpha: 1.0 init_enc_alpha: 1.0 init_type: xavier_uniform pitch_embed_dropout: 0.0 pitch_embed_kernel_size: 1 pitch_predictor_chans: 256 pitch_predictor_dropout: 0.5 pitch_predictor_kernel_size: 5 pitch_predictor_layers: 5 positionwise_conv_kernel_size: 3 positionwise_layer_type: conv1d postnet_chans: 256 postnet_filts: 5 postnet_layers: 5 reduction_factor: 1 spk_embed_dim: 256 spk_embed_integration_type: concat stop_gradient_from_energy_predictor: False stop_gradient_from_pitch_predictor: True transformer_dec_attn_dropout_rate: 0.2 transformer_dec_dropout_rate: 0.2 transformer_dec_positional_dropout_rate: 0.2 transformer_enc_attn_dropout_rate: 0.2 transformer_enc_dropout_rate: 0.2 transformer_enc_positional_dropout_rate: 0.2 use_scaled_pos_enc: True n_fft: 2048 n_mels: 80 n_shift: 300 num_snapshots: 5 num_workers: 2 optimizer: learning_rate: 0.001 optim: adam seed: 10086 updater: use_masking: True win_length: 1200 window: hann allow_cache: True batch_max_steps: 24000 batch_size: 8 discriminator_grad_norm: 1 discriminator_optimizer_params: epsilon: 1e-06 weight_decay: 0.0 discriminator_params: bias: True conv_channels: 64 in_channels: 1 kernel_size: 3 layers: 10 nonlinear_activation: LeakyReLU nonlinear_activation_params: negative_slope: 0.2 out_channels: 1 use_weight_norm: True discriminator_scheduler_params: gamma: 0.5 learning_rate: 5e-05 step_size: 200000 discriminator_train_start_steps: 100000 eval_interval_steps: 1000 fmax: 7600 fmin: 80 fs: 24000 generator_grad_norm: 10 generator_optimizer_params: epsilon: 1e-06 weight_decay: 0.0 generator_params: aux_channels: 80 aux_context_window: 2 dropout: 0.0 gate_channels: 128 in_channels: 1 kernel_size: 3 layers: 30 out_channels: 1 residual_channels: 64 skip_channels: 64 stacks: 3 upsample_scales: [4, 5, 3, 5] use_weight_norm: True generator_scheduler_params: gamma: 0.5 learning_rate: 0.0001 step_size: 200000 lambda_adv: 4.0 n_fft: 2048 n_mels: 80 n_shift: 300 num_save_intermediate_results: 4 num_snapshots: 10 num_workers: 4 pin_memory: True remove_short_samples: True save_interval_steps: 5000 seed: 42 stft_loss_params: fft_sizes: [1024, 2048, 512] hop_sizes: [120, 240, 50] win_lengths: [600, 1200, 240] window: hann train_max_steps: 1000000 win_length: 1200 window: hann exp/fastspeech2_aishell3_english/phone_id_map.txt frontend done! vocab_size: 180 spk_num: 1 encoder_type is transformer decoder_type is transformer acoustic model done! voc done! Building prefix dict from the default dictionary ... DEBUG 2022-06-03 21:58:00,970 init.py:113] Building prefix dict from the default dictionary ... Dumping model to file cache C:\Users\yrsn509\AppData\Local\Temp\jieba.cache DEBUG 2022-06-03 21:58:01,479 init.py:146] Dumping model to file cache C:\Users\yrsn509\AppData\Local\Temp\jieba.cache Loading model cost 0.558 seconds. DEBUG 2022-06-03 21:58:01,529 init.py:164] Loading model cost 0.558 seconds. Prefix dict has been built successfully. DEBUG 2022-06-03 21:58:01,530 init.py:166] Prefix dict has been built successfully. C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\math_op_patch.py:276: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.int32, the right dtype will convert to paddle.int64 warnings.warn( Traceback (most recent call last): File "D:\VTuberTalk\train\exps\synthesize_e2e.py", line 333, in main() File "D:\VTuberTalk\train\exps\synthesize_e2e.py", line 329, in main evaluate(args) File "D:\VTuberTalk\train\exps\synthesize_e2e.py", line 131, in evaluate mel = am_inference(part_phone_ids, d_scale, d_bias, p_scale, p_bias, e_scale, e_bias, robot, spk_id) File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, kwargs) File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "D:\VTuberTalk\train/models\fastspeech2\fastspeech2.py", line 1027, in forward normalized_mel, d_outs, p_outs, e_outs, mu, logvar, z = self.acousticmodel.inference( File "D:\VTuberTalk\train/models\fastspeech2\fastspeech2.py", line 842, in inference , outs, d_outs, p_outs, e_outs, mu, logvar, z = self._forward( File "D:\VTuberTalk\train/models\fastspeech2\fastspeech2.py", line 615, in _forward spk_emb = self.spk_embedding_table(spk_id) File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\nn\layer\common.py", line 1464, in forward return F.embedding( File "C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\nn\functional\input.py", line 204, in embedding return _C_ops.lookup_table_v2( ValueError: (InvalidArgument) Variable value (input) of OP(fluid.layers.embedding) expected >= 0 and < 1, but got 175. Please check input value. [Hint: Expected ids[i] < row_number, but received ids[i]:175 >= row_number:1.] (at ..\paddle\phi\kernels\cpu\embedding_kernel.cc:63) [operator < lookup_table_v2 > error]

jerryuhoo commented 2 years ago

需要根据你自己训练的模型修改脚本里的模型名字和对应的speark id