Open housebaby opened 2 years ago
Please share with us your model code if possible?
Please share with us your model code if possible?
model = ORTModule(init_asr_model(configs)) this is how I use the ort.
when I print the model, it is as follows: ORTModule(` (encoder): ConformerEncoder( (global_cmvn): GlobalCMVN() (embed): Conv2dSubsampling4( (conv): Sequential( (0): Conv2d(1, 512, kernel_size=(3, 3), stride=(2, 2)) (1): ReLU() (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2)) (3): ReLU() ) (out): Sequential( (0): Linear(in_features=9728, out_features=512, bias=True) ) (pos_enc): RelPositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) ) (after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (encoders): ModuleList( (0): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (1): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (2): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (3): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (4): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (5): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (6): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (7): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (8): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (9): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (10): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (11): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) ) ) (decoder): TransformerDecoder( (embed): Sequential( (0): Embedding(5002, 512) (1): PositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) ) (after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (output_layer): Linear(in_features=512, out_features=5002, bias=True) (decoders): ModuleList( (0): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (1): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (2): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (3): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (4): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (5): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) ) ) (ctc): CTC( (ctc_lo): Linear(in_features=512, out_features=5002, bias=True) (ctc_loss): CTCLoss() ) (criterion_att): LabelSmoothingLoss( (criterion): KLDivLoss() ) )
Please share with us your model code if possible?
class ASRModel(torch.nn.Module):
"""CTC-attention hybrid Encoder-Decoder model"""
def __init__(
self,
vocab_size: int,
encoder: TransformerEncoder,
decoder: TransformerDecoder,
ctc: CTC,
ctc_weight: float = 0.5,
ignore_id: int = IGNORE_ID,
reverse_weight: float = 0.0,
lsm_weight: float = 0.0,
length_normalized_loss: bool = False,
):
assert 0.0 <= ctc_weight <= 1.0, ctc_weight
super().__init__()
# note that eos is the same as sos (equivalent ID)
self.sos = vocab_size - 1
self.eos = vocab_size - 1
self.vocab_size = vocab_size
self.ignore_id = ignore_id
self.ctc_weight = ctc_weight
self.reverse_weight = reverse_weight
self.encoder = encoder
self.decoder = decoder
self.ctc = ctc
self.criterion_att = LabelSmoothingLoss(
size=vocab_size,
padding_idx=ignore_id,
smoothing=lsm_weight,
normalize_length=length_normalized_loss,
)
def forward(
self,
speech: torch.Tensor,
speech_lengths: torch.Tensor,
text: torch.Tensor,
text_lengths: torch.Tensor,
) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor],
Optional[torch.Tensor]]:
"""Frontend + Encoder + Decoder + Calc loss
Args:
speech: (Batch, Length, ...)
speech_lengths: (Batch, )
text: (Batch, Length)
text_lengths: (Batch,)
"""
assert text_lengths.dim() == 1, text_lengths.shape
# Check that batch_size is unified
assert (speech.shape[0] == speech_lengths.shape[0] == text.shape[0] ==
text_lengths.shape[0]), (speech.shape, speech_lengths.shape,
text.shape, text_lengths.shape)
# 1. Encoder
encoder_out, encoder_mask = self.encoder(speech, speech_lengths)
encoder_out_lens = encoder_mask.squeeze(1).sum(1)
# 2a. Attention-decoder branch
if self.ctc_weight != 1.0:
loss_att, acc_att = self._calc_att_loss(encoder_out, encoder_mask,
text, text_lengths)
else:
loss_att = None
# 2b. CTC branch
if self.ctc_weight != 0.0:
loss_ctc = self.ctc(encoder_out, encoder_out_lens, text,
text_lengths)
else:
loss_ctc = None
if loss_ctc is None:
loss = loss_att
elif loss_att is None:
loss = loss_ctc
else:
loss = self.ctc_weight * loss_ctc + (1 -
self.ctc_weight) * loss_att
return loss, loss_att, loss_ctc
Hi, would you please provide steps to repro the issue? including sample data and run scripts? Also what's your runtime env? i.e., installations, ort version, etc. Older version of ORT may not have obvious gain, you may try our latest release + upgraded torch and it makes difference before getting back to us. https://download.onnxruntime.ai/ Thanks.
I have tried using ort in training transformer . But it seems that no speed up is got. I wonder whether i have missed someting in configuration.