pytorch / ort

Accelerate PyTorch models with ONNX Runtime
MIT License
354 stars 52 forks source link

no speedup using ort #103

Open housebaby opened 2 years ago

housebaby commented 2 years ago

I have tried using ort in training transformer . But it seems that no speed up is got. I wonder whether i have missed someting in configuration.

baijumeswani commented 2 years ago

Please share with us your model code if possible?

housebaby commented 2 years ago

Please share with us your model code if possible?

model = ORTModule(init_asr_model(configs)) this is how I use the ort.

when I print the model, it is as follows: ORTModule(` (encoder): ConformerEncoder( (global_cmvn): GlobalCMVN() (embed): Conv2dSubsampling4( (conv): Sequential( (0): Conv2d(1, 512, kernel_size=(3, 3), stride=(2, 2)) (1): ReLU() (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2)) (3): ReLU() ) (out): Sequential( (0): Linear(in_features=9728, out_features=512, bias=True) ) (pos_enc): RelPositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) ) (after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (encoders): ModuleList( (0): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (1): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (2): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (3): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (4): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (5): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (6): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (7): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (8): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (9): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (10): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) (11): ConformerEncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=512, out_features=512, bias=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (feed_forward_macaron): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): SiLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512) (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,)) (activation): SiLU() ) (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear): Linear(in_features=1024, out_features=512, bias=True) ) ) ) (decoder): TransformerDecoder( (embed): Sequential( (0): Embedding(5002, 512) (1): PositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) ) (after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (output_layer): Linear(in_features=512, out_features=5002, bias=True) (decoders): ModuleList( (0): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (1): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (2): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (3): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (4): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) (5): DecoderLayer( (self_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (src_attn): MultiHeadedAttention( (linear_q): Linear(in_features=512, out_features=512, bias=True) (linear_k): Linear(in_features=512, out_features=512, bias=True) (linear_v): Linear(in_features=512, out_features=512, bias=True) (linear_out): Linear(in_features=512, out_features=512, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=512, out_features=2048, bias=True) (activation): ReLU() (dropout): Dropout(p=0.1, inplace=False) (w_2): Linear(in_features=2048, out_features=512, bias=True) ) (norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (concat_linear1): Linear(in_features=1024, out_features=512, bias=True) (concat_linear2): Linear(in_features=1024, out_features=512, bias=True) ) ) ) (ctc): CTC( (ctc_lo): Linear(in_features=512, out_features=5002, bias=True) (ctc_loss): CTCLoss() ) (criterion_att): LabelSmoothingLoss( (criterion): KLDivLoss() ) )

housebaby commented 2 years ago

Please share with us your model code if possible?

class ASRModel(torch.nn.Module):
    """CTC-attention hybrid Encoder-Decoder model"""
    def __init__(
        self,
        vocab_size: int,
        encoder: TransformerEncoder,
        decoder: TransformerDecoder,
        ctc: CTC,
        ctc_weight: float = 0.5,
        ignore_id: int = IGNORE_ID,
        reverse_weight: float = 0.0,
        lsm_weight: float = 0.0,
        length_normalized_loss: bool = False,
    ):
        assert 0.0 <= ctc_weight <= 1.0, ctc_weight

        super().__init__()
        # note that eos is the same as sos (equivalent ID)
        self.sos = vocab_size - 1
        self.eos = vocab_size - 1
        self.vocab_size = vocab_size
        self.ignore_id = ignore_id
        self.ctc_weight = ctc_weight
        self.reverse_weight = reverse_weight

        self.encoder = encoder
        self.decoder = decoder
        self.ctc = ctc
        self.criterion_att = LabelSmoothingLoss(
            size=vocab_size,
            padding_idx=ignore_id,
            smoothing=lsm_weight,
            normalize_length=length_normalized_loss,
        )

    def forward(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
        text: torch.Tensor,
        text_lengths: torch.Tensor,
    ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor],
               Optional[torch.Tensor]]:
        """Frontend + Encoder + Decoder + Calc loss

        Args:
            speech: (Batch, Length, ...)
            speech_lengths: (Batch, )
            text: (Batch, Length)
            text_lengths: (Batch,)
        """
        assert text_lengths.dim() == 1, text_lengths.shape
        # Check that batch_size is unified
        assert (speech.shape[0] == speech_lengths.shape[0] == text.shape[0] ==
                text_lengths.shape[0]), (speech.shape, speech_lengths.shape,
                                         text.shape, text_lengths.shape)
        # 1. Encoder
        encoder_out, encoder_mask = self.encoder(speech, speech_lengths)
        encoder_out_lens = encoder_mask.squeeze(1).sum(1)

        # 2a. Attention-decoder branch
        if self.ctc_weight != 1.0:
            loss_att, acc_att = self._calc_att_loss(encoder_out, encoder_mask,
                                                    text, text_lengths)
        else:
            loss_att = None

        # 2b. CTC branch
        if self.ctc_weight != 0.0:
            loss_ctc = self.ctc(encoder_out, encoder_out_lens, text,
                                text_lengths)
        else:
            loss_ctc = None

        if loss_ctc is None:
            loss = loss_att
        elif loss_att is None:
            loss = loss_ctc
        else:
            loss = self.ctc_weight * loss_ctc + (1 -
                                                 self.ctc_weight) * loss_att
        return loss, loss_att, loss_ctc
ytaous commented 2 years ago

Hi, would you please provide steps to repro the issue? including sample data and run scripts? Also what's your runtime env? i.e., installations, ort version, etc. Older version of ORT may not have obvious gain, you may try our latest release + upgraded torch and it makes difference before getting back to us. https://download.onnxruntime.ai/ Thanks.