Open yrsn509 opened 2 years ago
可以搜索一下conda配置镜像来加速
不行 我清华源和阿里源都试过了 还是用不了
这行加了吗?conda config --add channels conda-forge
这行加了吗?
conda config --add channels conda-forge
那肯定 但是这个源不是国外的吗 速度很慢
这行加了吗?
conda config --add channels conda-forge
现在就卡在这里: Solving environment: failed with initial frozen solve. Retrying with flexible solve. Solving environment: /
我也没遇到过这种情况,看看这里的其他安装方法能否解决? https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html
我也没遇到过这种情况,看看这里的其他安装方法能否解决? https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html
我看了一下官网的说明,貌似windows平台用不了G2P? G2P影不影响声音合成功能?
不影响,这个是用于对齐音素的,只要有字典就可以。
不影响,这个是用于对齐音素的,只要有字典就可以。
是否需要安装以下这两个东西: mfa models download acoustic mandarin_mfa mfa models download dictionary mandarin_china_mfa
我无法用指令安装,反复提示有网络问题(已科学上网),只好手动github下载了,但不知道放在哪里
不影响,这个是用于对齐音素的,只要有字典就可以。
另外我按照mfa官网安装教程 安装的python版本是3.9 会不会用不了VtuberTalk?
应该不影响吧。。
应该不影响吧。。
还有TextGrid这个文件夹是第几步生成的?如何自己创建这个文件夹? (因为我打算用Mockingbird的数据集迁移到这里训练,就不想重复执行一些操作了)
TextGrid是mfa根据拼音生成音素生成的,需要有.wav和.lab文件一一对应。
TextGrid是mfa根据拼音生成音素生成的,需要有.wav和.lab文件一一对应。 就是说TextGrid是mfa处理那步骤才生成的吗,那为什么Readme的2.9就已经出现了TextGrid? .lab文件又是哪里出来的……
2.9这里应该写错了,不需要TextGrid_temp,lab文件就是一条音频对应的拼音,用于生成TextGrid。
2.9这里应该写错了,不需要TextGrid_temp,lab文件就是一条音频对应的拼音,用于生成TextGrid。
我想问问字幕和lab文件的格式是什么,是这样吗(上面是字幕,下面是lab) 字幕格式:文件名+空格+文本
不是,建议先小规模运行一下run_preprocess.sh,lab和txt还有wav是一一对应的,lab只有拼音。比如音频是你好,lab就是ni3 hao3
不是,建议先小规模运行一下run_preprocess.sh,lab和txt还有wav是一一对应的,lab只有拼音。比如音频是你好,lab就是ni3 hao3
请问我该如何运行.sh,我在VS CODE里貌似打开不了
windows系统应该不行,可以用docker
但是拆分成单步可以运行,linux相关的命令windows肯定不支持的
但是拆分成单步可以运行,linux相关的命令windows肯定不支持的
终于能运行到MFA那步了 哭死 想问问数据集多少条能达到效果?是不是越多越好?
但是拆分成单步可以运行,linux相关的命令windows肯定不支持的
这个MFA怎么感觉无穷无尽啊……monophone,triphone,Ida,SAT,SAT_2…… 要到什么阶段才结束?
导出模型时出错: 训练模型:fastspeech2_aishell3_english(叫这名字,但实际是中文) vocoder:pwg_aishell3_ckpt_0.5
voc_config is default!
model: fastspeech2, multiple
C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\constantq.py:1059: DeprecationWarning: np.complex
is a deprecated alias for the builtin complex
. To silence this warning, use complex
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.complex128
here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
dtype=np.complex,
========Args========
am: fastspeech2_aishell3
am_ckpt: exp/fastspeech2_aishell3_english/checkpoints/snapshot_iter_6300.pdz
am_config: exp/fastspeech2_aishell3_english/default_multi.yaml
am_stat: exp/fastspeech2_aishell3_english/speech_stats.npy
energy_stat: exp/fastspeech2_aishell3_english/energy_stats.npy
inference_dir: train/inference
lang: zh
ngpu: 0
output_dir: train/test_e2e
phones_dict: exp/fastspeech2_aishell3_english/phone_id_map.txt
pitch_stat: exp/fastspeech2_aishell3_english/pitch_stats.npy
speaker_dict: exp/fastspeech2_aishell3_english/speaker_id_map.txt
spk_id: 175
text: sentences.txt
tones_dict: null
use_gst: false
use_style: true
use_vae: false
voc: pwgan_aishell3
voc_ckpt: pretrained_models/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz
voc_config: pretrained_models/pwg_aishell3_ckpt_0.5/default.yaml
voc_stat: pretrained_models/pwg_aishell3_ckpt_0.5/feats_stats.npy
========Config========
batch_size: 8
f0max: 400
f0min: 80
fmax: 7600
fmin: 80
fs: 24000
max_epoch: 100
model:
adim: 384
aheads: 2
decoder_normalize_before: True
dlayers: 4
dunits: 1536
duration_predictor_chans: 256
duration_predictor_kernel_size: 3
duration_predictor_layers: 2
elayers: 4
encoder_normalize_before: True
energy_embed_dropout: 0.0
energy_embed_kernel_size: 1
energy_predictor_chans: 256
energy_predictor_dropout: 0.5
energy_predictor_kernel_size: 3
energy_predictor_layers: 2
eunits: 1536
init_dec_alpha: 1.0
init_enc_alpha: 1.0
init_type: xavier_uniform
pitch_embed_dropout: 0.0
pitch_embed_kernel_size: 1
pitch_predictor_chans: 256
pitch_predictor_dropout: 0.5
pitch_predictor_kernel_size: 5
pitch_predictor_layers: 5
positionwise_conv_kernel_size: 3
positionwise_layer_type: conv1d
postnet_chans: 256
postnet_filts: 5
postnet_layers: 5
reduction_factor: 1
spk_embed_dim: 256
spk_embed_integration_type: concat
stop_gradient_from_energy_predictor: False
stop_gradient_from_pitch_predictor: True
transformer_dec_attn_dropout_rate: 0.2
transformer_dec_dropout_rate: 0.2
transformer_dec_positional_dropout_rate: 0.2
transformer_enc_attn_dropout_rate: 0.2
transformer_enc_dropout_rate: 0.2
transformer_enc_positional_dropout_rate: 0.2
use_scaled_pos_enc: True
n_fft: 2048
n_mels: 80
n_shift: 300
num_snapshots: 5
num_workers: 2
optimizer:
learning_rate: 0.001
optim: adam
seed: 10086
updater:
use_masking: True
win_length: 1200
window: hann
allow_cache: True
batch_max_steps: 24000
batch_size: 8
discriminator_grad_norm: 1
discriminator_optimizer_params:
epsilon: 1e-06
weight_decay: 0.0
discriminator_params:
bias: True
conv_channels: 64
in_channels: 1
kernel_size: 3
layers: 10
nonlinear_activation: LeakyReLU
nonlinear_activation_params:
negative_slope: 0.2
out_channels: 1
use_weight_norm: True
discriminator_scheduler_params:
gamma: 0.5
learning_rate: 5e-05
step_size: 200000
discriminator_train_start_steps: 100000
eval_interval_steps: 1000
fmax: 7600
fmin: 80
fs: 24000
generator_grad_norm: 10
generator_optimizer_params:
epsilon: 1e-06
weight_decay: 0.0
generator_params:
aux_channels: 80
aux_context_window: 2
dropout: 0.0
gate_channels: 128
in_channels: 1
kernel_size: 3
layers: 30
out_channels: 1
residual_channels: 64
skip_channels: 64
stacks: 3
upsample_scales: [4, 5, 3, 5]
use_weight_norm: True
generator_scheduler_params:
gamma: 0.5
learning_rate: 0.0001
step_size: 200000
lambda_adv: 4.0
n_fft: 2048
n_mels: 80
n_shift: 300
num_save_intermediate_results: 4
num_snapshots: 10
num_workers: 4
pin_memory: True
remove_short_samples: True
save_interval_steps: 5000
seed: 42
stft_loss_params:
fft_sizes: [1024, 2048, 512]
hop_sizes: [120, 240, 50]
win_lengths: [600, 1200, 240]
window: hann
train_max_steps: 1000000
win_length: 1200
window: hann
exp/fastspeech2_aishell3_english/phone_id_map.txt
frontend done!
vocab_size: 180
spk_num: 1
encoder_type is transformer
decoder_type is transformer
acoustic model done!
voc done!
Building prefix dict from the default dictionary ...
DEBUG 2022-06-03 21:58:00,970 init.py:113] Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\yrsn509\AppData\Local\Temp\jieba.cache
DEBUG 2022-06-03 21:58:01,479 init.py:146] Dumping model to file cache C:\Users\yrsn509\AppData\Local\Temp\jieba.cache
Loading model cost 0.558 seconds.
DEBUG 2022-06-03 21:58:01,529 init.py:164] Loading model cost 0.558 seconds.
Prefix dict has been built successfully.
DEBUG 2022-06-03 21:58:01,530 init.py:166] Prefix dict has been built successfully.
C:\Users\yrsn509\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\dygraph\math_op_patch.py:276: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.int32, the right dtype will convert to paddle.int64
warnings.warn(
Traceback (most recent call last):
File "D:\VTuberTalk\train\exps\synthesize_e2e.py", line 333, in
需要根据你自己训练的模型修改脚本里的模型名字和对应的speark id
使用这行命令速度过慢而且经常报错:conda install montreal-forced-aligner 我就直接在虚拟环境下pip安装了 但是安装不了MFA的第三方库 请问有什么更好的办法