mravanelli / pytorch-kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
2.36k stars 446 forks source link

supervised labels for training #69

Closed narcise closed 5 years ago

mravanelli commented 5 years ago

Hi, what do you mean exactly with "synchronize these two networks"? It means that the set of posteriors probabilities have different time resolution?

Mirco

On Sun, 17 Mar 2019 at 14:06, narcise notifications@github.com wrote:

Hello My project is about estimating articulatory trajectories from acoustic models. Previously, I trained both streams of acoustic and artuclatory data with separate BLSTMs and they resulted in different set of posterior probabilities. However, I need to synchronize these two networks in order to estimate articulatory trajectories from test acoustic models, and the posterior probabilities can not be synchronized. (I meant I cannot feed the posterior probabilities from acoustic model to the articulatory BLSTM and estimate the articulatory features). Now, I'm trying to use the static articulatory features as supervised labels for training acoustic model with BLSTM to estimate these labels (articulatory trajectories) from test acoustic features. Is it possible to use these set of labels (which are in .ark format) exactly like phoneme labels for speech recognition, then predict these labels for test acoustic data? Any help would be greatly appreciated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uE0l9dTdNlLjTns0_tGke19jq7Eks5vXoQygaJpZM4b4i7k .

mravanelli commented 5 years ago

This should be possible. But you have to make sure that the two features have the same number of time steps (e..g, if for sentence 1 you have 100 acoustic features you should have 100 articulatory features as well). We recently add into the documentation a part called "training an autoencoder". I think you can check it and modify the cfg files with the articulatory features as targets...

Mirco

On Sun, 17 Mar 2019 at 14:43, Narcissus notifications@github.com wrote:

I need to find the corresponded articulatory trajectories for each acoustic phoneme. I had two streams of Acoustic-articulatory features that have been recorded simultaneously for every utterance. Therefore, I extracted different set of features (Acoustic and articulatory features) for each of them and then trained two BLSTMs. Which means I can now do speech recognition in two different ways with using articulatory and acoustic features. However, my goal is to predict articulatory features from acoustic features. I thought training two parallel BLSTMs would be helpful if I could use the posterior probabilities from acoustic network (lets say the acoustic RNN recognized the phoneme 'X' for the test acoustic data) and then use the second network to predict the articulation trajectory (estimating which articulatory features can produce posterior probability of 'X' in articulatory RNN).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-473698914, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1kCk0Gpz8u3zqPcCH8cSgSZwKQwPks5vXozvgaJpZM4b4i7k .

mravanelli commented 5 years ago

no, this is not important. Only the number of time steps should be the same.

On Sun, 17 Mar 2019 at 14:57, Narcissus notifications@github.com wrote:

Is the dimension also important? I have 10 Articulatory features and 13 mfccs for example.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-473703476, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1t4oWUY73EkY6Mdd91SpA56LN6GXks5vXpADgaJpZM4b4i7k .

mravanelli commented 5 years ago

Hi, I'm not 100% sure to have understood the system you would like to implement. However, I think that you can train two autoencoders plus one decoder that predicts the acoustic features with in input the encoded version of the articulatory features. The nice thing is that within pytorch-kaldi you can jointly train this complex system simply by changing the configuration files. For instance, your model section should be something like this:

[model]
model_proto = proto/model.proto
model = enc_acoustic=compute(MLP_encoder_acoustic,acoustic_fea)
    dec_acoustic=compute(MLP_decoder_acoustic,enc_acoustic)
        enc_articulatory=compute(MLP_encoder_articulatory,articulatory_fea)
    dec_articulatory=compute(MLP_decoder_articulatory,enc_articulatory)
        dec_articulatory2=compute(MLP_decoder_articulatory2,enc_acoustic)
    loss1=mse(dec_acoustic,acoustic_fea)
        loss2=mse(dec_articulatory,articulatory_fea)
        loss3=mse(dec_articulatory2,acoustic_fea)
        loss_sum1=sum(loss1,loss2)
        loss_final=sum(loss_sum1,loss3)
    err_final=cost_err(dec_acoustic,lab_cd)

Of course you have to make sure to define in the cfg all the features and architectures define in the model section (see our examples available, in particular the autoencoder one).

mravanelli commented 5 years ago

Hi, let me take a closer look when I have a little bit of time. Remember that you can define multiple features withing the same dataset (this is what you should do in this). For instance you can define a single training dataset (e.g., EMA_MAE_tr) that contains both acoustic and articulatory features. Maybe you can take a look into our multi-feature example here: https://github.com/mravanelli/pytorch-kaldi/blob/master/cfg/TIMIT_baselines/TIMIT_mfcc_fbank_fmllr_liGRU_best.cfg

On Wed, 27 Mar 2019 at 19:00, Narcissus notifications@github.com wrote:

Thank you very much. My target is to estimate articulatory features for the test acoustic entries. Therefore, I think the model should be like this: model = enc_acoustic=compute(MLP_encoder_acoustic,fbank) dec_acoustic=compute(MLP_decoder_acoustic,enc_acoustic)

enc_articulatory=compute(MLP_encoder_articulatory,art)

dec_articulatory=compute(MLP_decoder_articulatory,enc_articulatory) dec_articulatory2=compute(MLP_decoder_articulatory,enc_acoustic)

loss1=mse(dec_acoustic,fbank)
loss2=mse(dec_articulatory,art)
loss3=mse(dec_articulatory2,fbank)
loss_sum1=sum(loss1,loss2)
loss_final=sum(loss_sum1,loss3)

err_final=cost_err(dec_articulatory2,lab_cd_articulatory)

Moreover, I changed the cfg file to this [configuration:](`[cfg_proto] cfg_proto = proto/global.proto cfg_proto_chunk = proto/global_chunk.proto

[exp] cmd = run_nn_script = run_nn.py out_folder = exp/EMA_MAE_MLP_autoencoder seed = 2234 use_cuda = True multi_gpu = False save_gpumem = False n_epochs_tr = 10

[dataset1] data_name = EMA_MAE_tr_acoustic fea = fea_name=fbank fea_lst=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/train/feats_fbank.scp fea_opts=apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/train/utt2spk ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/fbank/cmvn_train.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=5 cw_right=5

lab = lab_name=lab_cd_acoustic lab_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/dnn4_pretrain-dbn_dnn_ali lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/train/ lab_graph=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/tri3/graph

n_chunks = 5

[dataset2] data_name = EMA_MAE_tr_articulatory fea = fea_name=art

fea_lst=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/train/feats_art.scp fea_opts=apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/train/utt2spk ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/art/cmvn_train.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=5 cw_right=5

lab = lab_name=lab_cd_articulatory

lab_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/dnn4_pretrain-dbn_dnn_ali lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/train/ lab_graph=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/exp/tri3/graph

n_chunks = 5

[dataset3] data_name = EMA_MAE_dev_acoustic fea = fea_name=fbank fea_lst=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/dev/feats_fbank.scp fea_opts=apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/dev/utt2spk ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/fbank/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=5 cw_right=5

lab= lab_name=lab_cd_acoustic

lab_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/dnn4_pretrain-dbn_dnn_ali_dev lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/dev/ lab_graph=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/tri3/graph

n_chunks = 1 [dataset4] data_name = EMA_MAE_dev_articulatory fea= fea_name=art

fea_lst=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/dev/feats_art.scp fea_opts=apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/dev/utt2spk ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/art/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=5 cw_right=5

lab = lab_name=lab_cd_articulatory

lab_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/exp/dnn4_pretrain-dbn_dnn_ali_dev lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/data/dev/ lab_graph=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/Articulatory/exp/tri3/graph

n_chunks = 1

[dataset5] data_name = EMA_MAE_test_acoustic fea = fea_name=fbank fea_lst=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/test/feats_fbank.scp fea_opts=apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/test/utt2spk ark:/audio/kaldi/kaldi/egs/ORG_EMA_MAE/fbank/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=5 cw_right=5

lab = lab_name=lab_cd_acoustic

lab_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/dnn4_pretrain-dbn_dnn_ali_test lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/data/test/ lab_graph=/audio/kaldi/kaldi/egs/ORG_EMA_MAE/exp/tri3/graph

n_chunks = 1

[data_use1] train_with = EMA_MAE_tr_acoustic valid_with = EMA_MAE_dev_acoustic forward_with = EMA_MAE_test_acoustic

[data_use2] train_with = EMA_MAE_tr_articulatory valid_with = EMA_MAE_dev_articulatory forward_with = EMA_MAE_test_acoustic

[batches1] batch_size_train = 128 max_seq_length_train = 1000 increase_seq_length_train = False start_seq_len_train = 100 multply_factor_seq_len_train = 2 batch_size_valid = 128 max_seq_length_valid = 1000

[architecture1] arch_name = MLP_encoder_acoustic arch_proto = proto/MLP.proto arch_library = neural_networks arch_class = MLP arch_pretrain_file = none arch_freeze = False arch_seq_model = False dnn_lay = 1024,100 dnn_drop = 0.15,0.15 dnn_use_laynorm_inp = False dnn_use_batchnorm_inp = False dnn_use_batchnorm = True,True dnn_use_laynorm = False,False dnn_act = relu,linear arch_lr = 0.08 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = sgd opt_momentum = 0.0 opt_weight_decay = 0.0 opt_dampening = 0.0 opt_nesterov = False

[architecture2] arch_name = MLP_decoder_acoustic arch_proto = proto/MLP.proto arch_library = neural_networks arch_class = MLP arch_pretrain_file = none arch_freeze = False arch_seq_model = False dnn_lay = 1024,440 dnn_drop = 0.15,0.0 dnn_use_laynorm_inp = False dnn_use_batchnorm_inp = False dnn_use_batchnorm = True,False dnn_use_laynorm = False,False dnn_act = relu,linear arch_lr = 0.08 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = sgd opt_momentum = 0.0 opt_weight_decay = 0.0 opt_dampening = 0.0 opt_nesterov = False

[architecture3] arch_name = MLP_encoder_articulatory arch_proto = proto/MLP.proto arch_library = neural_networks arch_class = MLP arch_pretrain_file = none arch_freeze = False arch_seq_model = False dnn_lay = 1024,100 dnn_drop = 0.15,0.15 dnn_use_laynorm_inp = False dnn_use_batchnorm_inp = False dnn_use_batchnorm = True,True dnn_use_laynorm = False,False dnn_act = relu,linear arch_lr = 0.08 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = sgd opt_momentum = 0.0 opt_weight_decay = 0.0 opt_dampening = 0.0 opt_nesterov = False

[architecture4] arch_name = MLP_decoder_articulatory arch_proto = proto/MLP.proto arch_library = neural_networks arch_class = MLP arch_pretrain_file = none arch_freeze = False arch_seq_model = False dnn_lay = 1024,440 dnn_drop = 0.15,0.0 dnn_use_laynorm_inp = False dnn_use_batchnorm_inp = False dnn_use_batchnorm = True,False dnn_use_laynorm = False,False dnn_act = relu,linear arch_lr = 0.08 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = sgd opt_momentum = 0.0 opt_weight_decay = 0.0 opt_dampening = 0.0 opt_nesterov = False

[model] model_proto = proto/model.proto

model = enc_acoustic=compute(MLP_encoder_acoustic,fbank) dec_acoustic=compute(MLP_decoder_acoustic,enc_acoustic)

enc_articulatory=compute(MLP_encoder_articulatory,art)

dec_articulatory=compute(MLP_decoder_articulatory,enc_articulatory) dec_articulatory2=compute(MLP_decoder_articulatory,enc_acoustic)

loss1=mse(dec_acoustic,fbank)
loss2=mse(dec_articulatory,art)
loss3=mse(dec_articulatory2,fbank)
loss_sum1=sum(loss1,loss2)
loss_final=sum(loss_sum1,loss3)

err_final=cost_err(dec_articulatory2,lab_cd_articulatory)

[forward] forward_out = enc_out normalize_posteriors = False normalize_with_counts_from = None save_out_file = True require_decoding = False

[decoding] decoding_script_folder = kaldi_decoding_scripts/ decoding_script = decode_dnn.sh decoding_proto = proto/decoding.proto min_active = 200 max_active = 7000 max_mem = 50000000 beam = 13.0 latbeam = 8.0 acwt = 0.2 max_arcs = -1 skip_scoring = false scoring_script = local/score.sh scoring_opts = "--min-lmwt 1 --max-lmwt 10" norm_vars = False `) I don't know how I can add the new architectures for training and validating the articulatory information. In addition I weren't able to add 'data_use' like this. Would help me to edit these mistakes, please? Moreover, I have a general question: how these two encoders are connected to the single decoder? are they sharing the same BN layer? or are they sharing the the last hidden layer states at the end of the encoding section ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-477379242, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1q_Q2JQGHN0Zl8LGhUVOVa_n76pKks5va_gQgaJpZM4b4i7k .

mravanelli commented 5 years ago

could you post the new config file?

On Thu, 28 Mar 2019 at 13:38, Narcissus notifications@github.com wrote:

I changed the config file. However when I run the code I get the following error: python run_exp.py cfg/TIMIT_baselines/EMA_MAE_MLP_autoencoder.cfg

  • Reading config file......OK! ['fbank', 'art'] Traceback (most recent call last): File "run_exp.py", line 87, in create_lists(config) File "/audio/kaldi/kaldi/egs/ORG_EMA_MAE/pytorch-kaldi/utils.py", line 940, in create_lists full_list.append([line.rstrip('\n')+',' for line in open(list_fea[i])]) IndexError: list index out of range

Is this because of the inconsistency of the features (Fbank and 'art') sizes?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-477699395, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1v6RTr3fAkXCCoM9OxwEzlHyyushks5vbP4bgaJpZM4b4i7k .

TParcollet commented 5 years ago

This error looks like a path error, I would double check the .scp files (if they exist, and if the paths inside the file exist).

TParcollet commented 5 years ago

Can you print pattern and line at line 1529 of utils.py ? I suppose that this is an other path problem.

mravanelli commented 5 years ago

I think the problem is that you have an empty line in the [model] section, right? This is not allowed in the current version..

On Thu, 28 Mar 2019 at 15:42, Narcissus notifications@github.com wrote:

(.)=(.)((.),(.)) enc_acoustic=compute(MLP_encoder_acoustic,fbank) (.)=(.)((.),(.)) dec_acoustic=compute(MLP_decoder_acoustic,enc_acoustic)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-477743203, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1rHrNIkeVVRSr1pZDZiXchbG-KrJks5vbRsggaJpZM4b4i7k .

mravanelli commented 5 years ago

Yes, to make your system working all the features of all the sentences must have the same number of time frames. Otherwise you surely run into this concatenation error. You can probably read the two features with the same command you find in the function load_dataset in data_io.py: fea = { k:m for k,m in read_mat_ark('ark:copy-feats scp:'+fea_scp+' ark:- |' +fea_opts,output_folder) } You can manually read the features adding the rproper values to the variables fea_scp, fea_opts,output_folder. Make sure to import the read_mat_ark function. fea will be a dictionary where all the keys are the sentence_id and the values are the sequence of features

Mirco

On Thu, 28 Mar 2019 at 17:34, Narcissus notifications@github.com wrote:

Thank you very much!!!! Can you help me to check the sizes of my features as well? I know with feat-to-dim I can only check the dimension which is 13 for mfcc and 10 for Art, however I can't check the length of the features. Which I think I have to fix it to be exactly the same. Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/local/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/audio/kaldi/kaldi/egs/ORG_EMA_MAE/pytorch-kaldi/data_io.py", line 207, in read_lab_fea [data_name_fea,data_set_fea,data_end_index_fea]=load_chunk(fea_scp,fea_opts,lab_folder,lab_opts,cw_left,cw_right,max_seq_length, output_folder, fea_only) File "/audio/kaldi/kaldi/egs/ORG_EMA_MAE/pytorch-kaldi/data_io.py", line 142, in load_chunk data_set=np.column_stack((data_set, data_lab)) File "/usr/local/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 640, in column_stack return _nx.concatenate(arrays, 1) ValueError: all the input array dimensions except for the concatenation axis must match exactly

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-477779705, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uM3A6T32v0YkLfI2l93gCBBbuAUks5vbTVTgaJpZM4b4i7k .

TParcollet commented 5 years ago

Also if you want to see how many time steps you have, you can check len(fea[k]) in the function load_dataset of data_io.py

TParcollet commented 5 years ago

Could print the shapes instead of the length ? Could you also paste your .cfg file ? So we can check the output sizes ?

TParcollet commented 5 years ago

From my point of view you're calling an MSE loss (loss1 for example) with dec_acoustic which as an output size of 440 with the FBANKs features that must be of size 23. Therefore, it can't work.

TParcollet commented 5 years ago

Why do you use the decoders for acoustic and articulatory encoders ? Do you want to train independently both models before combining them, or is your model trained jointly ? If so, you don't need to decode the two first encoders, just plug them into the top decoder ?

TParcollet commented 5 years ago

In the example, we take 40 FBANKS with a context window of size 11 = 11*40 = 440, in your cfg you take a window of size 1, and you might have generated FBANKs of size 23 :)

Check the cw_{left, right} on our example cfg

TParcollet commented 5 years ago

It can also work :) But for the features, your 440 (last decoder layer) should be 23 (unless you also want a context window)

mravanelli commented 5 years ago

This is specied within the forward section (see the documentation about that in the README file). In practice, at the end of training, a forward step is performed with the dataset specified in "forward_with = TIMIT_test". For example: [forward] forward_out = enc_out normalize_posteriors = False normalize_with_counts_from = None save_out_file = True require_decoding = False

Forward_out is one of the output that you have specified into the [model] section. In your case, it could be one between dec_acoustic, decoder_acoustic,enc_acoustic,enc_articulatory,dec_articulatory, and dec_articulatory2 (depending on the output that you would like to save). If you keep save_out_file=True you should find a single *.ark file (in exp files) that contains the output selected.

On Sun, 31 Mar 2019 at 16:10, Narcissus notifications@github.com wrote:

I fixed the problem. Thank you.

Do you know where and how I can save the estimated articulatory features (output of the last decoder)? EMA_MAE_autoencoder_test2.txt https://github.com/mravanelli/pytorch-kaldi/files/3027092/EMA_MAE_autoencoder_test2.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478376877, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1puROH5_IMFY7XepkOKa-1Hx5p_1ks5vcRZEgaJpZM4b4i7k .

mravanelli commented 5 years ago

Unfortunately these days I'm too busy with the Interspeech deadline. As soon as I have a bit of time I can take a look into it...

On Apr 1, 2019 13:06, "Narcissus" notifications@github.com wrote:

Closed #69 https://github.com/mravanelli/pytorch-kaldi/issues/69.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#event-2244021848, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1pVq8xKe9f9hgK6OhM02vG4KJpSiks5vcjyigaJpZM4b4i7k .

mravanelli commented 5 years ago

This is weird because when you pronouce a phoneme I would expect a correlation between ariculatory and acoustic features. Why you said that doesn't work? The network doesn't converge?

On Apr 1, 2019 19:43, "Narcissus" notifications@github.com wrote:

I 'v followed you suggestion and run the the architecture with 2 auto encoder and 1 decoder. It won't work. I guess the reason is that the articulatory features are synchronized with acoustic features but they are not representative of the same data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478787549, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1v5-X0UfHbS6ZK4IsTNVchy7XVUAks5vcpmTgaJpZM4b4i7k .

mravanelli commented 5 years ago

Which kind of problem are you observing with the suggested solution? Does not converge or something like this?

On Apr 1, 2019 20:02, "Narcissus" notifications@github.com wrote:

The articulatory information are in sensor space, while acoustic Fbanks are for other data steam acoustic space. These two data streams are synchronized (means they are recorded at the same time) but they are the same in nature. I checked the correlation between estimated articulatory results with the true articulatory trajectories the average correlation was around 0.2 . FYI I 'v implemented this experiment before with HMM_GMM and the latest correlation result was 0.68.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478791104, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1jagkii63Rkeg73GOXohrRSuvV7Fks5vcp4jgaJpZM4b4i7k .

mravanelli commented 5 years ago

Have you trained it on a single epoch only?

On Apr 1, 2019 20:19, "Narcissus" notifications@github.com wrote:

I ran the network for 1 epoch and I checked the output (estimated articulatory features .ark). The estimated result are very different from the reality. I also checked the outputs of the auto encoder which has trained only for articulatory features. The estimated outputs were close enough to the true trajectories (which means the encoder-decoder is working for one type of data, the articulatory type), however it doesn't work when we use the encoder part for one type and implementing the decoder for the other type.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478794193, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1naRGsRF3GUGrlqCG5W2yWSZ1veeks5vcqIRgaJpZM4b4i7k .

mravanelli commented 5 years ago

I never seen a neural network that converges in one epoch only. I would suggest to try with much more epochs and monitoring the performance on the test set. In parallel, as @tparcollet suggested you can try a simpler system based on an encoder fed by acoustic features and two decoders that estimate acoustic features and articulatory feature. You can implement it by simply modify the encoder cfg file we provided.

On Apr 1, 2019 20:24, "Narcissus" notifications@github.com wrote:

yes, I just wanted to check if that works properly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478795095, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1rTvSrnjZFNgblQw24cOo7wfZz_Lks5vcqM8gaJpZM4b4i7k .

mravanelli commented 5 years ago

Also, Why don't you try a even simpler system based on a single neural network that takes in input acoustic features and simply estimate the articulatory ones?

On Apr 1, 2019 20:54, "Narcissus" notifications@github.com wrote:

If you confirm this configuration I will run in tonight for 10 epochs and I will share the results tomorrow.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478800214, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1g5inbY7IYbTJeH-HsofKMGopG-Jks5vcqo2gaJpZM4b4i7k .

mravanelli commented 5 years ago

Yes, this is the most simple system that you can implement..

On Mon, 1 Apr 2019 at 21:17, Narcissus notifications@github.com wrote:

You mean using the articulatory features as supervised statistical labels for acoustic features, just like transcriptions?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/69#issuecomment-478805662, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1v9hReTzQb7F4A97re4bP3fex-RUks5vcq-RgaJpZM4b4i7k .

Baileyswu commented 4 years ago

In the example, we take 40 FBANKS with a context window of size 11 = 11*40 = 440, in your cfg you take a window of size 1, and you might have generated FBANKs of size 23 :)

Check the cw_{left, right} on our example cfg

I set the context window, then it goes to 23*11=253. I wonder how to generate 40?

TParcollet commented 4 years ago

This is related to Kaldi extraction (make_fbanks.sh). You can specify a config file to this script (usually in the conf folder) where you can set the number of bins.

Baileyswu commented 4 years ago

This is related to Kaldi extraction (make_fbanks.sh). You can specify a config file to this script (usually in the conf folder) where you can set the number of bins.

Fantastic! Thanks a lot 😘