k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
515 stars 103 forks source link

Issue in offline-websocket-server with zipformer (old) model #391

Open uni-sagar-raikar opened 1 year ago

uni-sagar-raikar commented 1 year ago

Hi,

We are facing the following issue with offline-websocket-server when used with zipformer1 model.

`terminate called after throwing an instance of 'std::runtime_error' what(): The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/zipformer.py", line 23, in forward _0 = torch.icefall.utils.make_pad_mask encoder_embed = self.encoder_embed x0 = (encoder_embed).forward(x, )


    x1 = torch.permute(x0, [1, 0, 2])
    lengths = torch.__rshift__(torch.sub(x_lens, 7), 1)
  File "code/__torch__/zipformer.py", line 169, in forward
    x14 = torch.unsqueeze(x, 1)
    conv = self.conv
    x15 = (conv).forward(x14, )
           ~~~~~~~~~~~~~ <--- HERE
    b, c, t, f, = torch.size(x15)
    out = self.out
  File "code/__torch__/torch/nn/modules/container.py", line 26, in forward
    _7 = getattr(self, "7")
    _8 = getattr(self, "8")
    input0 = (_0).forward(input, )
              ~~~~~~~~~~~ <--- HERE
    input1 = (_1).forward(input0, )
    input2 = (_2).forward(input1, )
  File "code/__torch__/torch/nn/modules/conv.py", line 23, in forward
    weight = self.weight
    bias = self.bias
    _0 = (self)._conv_forward(input, weight, bias, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    return _0
  def _conv_forward(self: __torch__.torch.nn.modules.conv.Conv2d,
  File "code/__torch__/torch/nn/modules/conv.py", line 29, in _conv_forward
    weight: Tensor,
    bias: Optional[Tensor]) -> Tensor:
    _1 = torch.conv2d(input, weight, bias, [1, 1], [0, 1], [1, 1])
         ~~~~~~~~~~~~ <--- HERE
    return _1
class Conv1d(Module):

Traceback of TorchScript, original code (most recent call last):
  File "/mnt/efs/blessingh/global_english/pruned_transducer_stateless7/zipformer.py", line 278, in forward
              of frames in `embeddings` before padding.
        """
        x = self.encoder_embed(x)
            ~~~~~~~~~~~~~~~~~~ <--- HERE

        x = x.permute(1, 0, 2)  # (N, T, C) -> (T, N, C)
  File "/mnt/efs/blessingh/global_english/pruned_transducer_stateless7/zipformer.py", line 1714, in forward
        # On entry, x is (N, T, idim)
        x = x.unsqueeze(1)  # (N, T, idim) -> (N, 1, T, idim) i.e., (N, C, H, W)
        x = self.conv(x)
            ~~~~~~~~~ <--- HERE
        # Now x is of shape (N, odim, (T-7)//2, ((idim-1)//2 - 1)//2)
        b, c, t, f = x.size()
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    def forward(self, input):
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    def forward(self, input: Tensor) -> Tensor:
        return self._conv_forward(input, self.weight, self.bias)
               ~~~~~~~~~~~~~~~~~~ <--- HERE
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
                            weight, bias, self.stride,
                            _pair(0), self.dilation, self.groups)
        return F.conv2d(input, weight, bias, self.stride,
               ~~~~~~~~ <--- HERE
                        self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (1 x 82). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)`

This is occuring on the server side and the connection from client breaks. Is this due to abrupt closure from client side? or insufficient audio samples being sent for decoding? Awaiting for help.

-Sagar
csukuangfj commented 1 year ago

Is it reproducible? Could you print the input tensor shape (N, T, C) to the model?

gabor-pinter commented 1 year ago

Hi, Yes it is reproducible. We could reproduce it with 2 audio files so far, of 1599 and 1600 sample sizes. Definitely small.

(1 x 82). Kernel size: (3 x 3). Kernel size can't be greater than actual input size
(2 x 39). Kernel size: (3 x 3). Kernel size can't be greater than actual input size
csukuangfj commented 1 year ago

If the sampling rate is 16 kHz, then 1600 samples correspond to only 0.1 second or about 10 feature frames, which is far too small.

Does it crash if you use a longer audio file?

gabor-pinter commented 1 year ago

Thanks for the quick response. No, it works okay with longer files. Not sure about the exact threshold, but as a quick fix, we filter out everything below 300ms on the client side.

I traced back the error with debugger up to OfflineConformerTransducerModel::RunEncoder where a dynamic function call is performed, calling "forward" on zipformer.py.

csukuangfj commented 1 year ago

The input has to have at least 9 feature frames to avoid crashes.

Could you print the input tensor shape?

csukuangfj commented 1 year ago

We suggest that the input tensor has to have 23 feature frames.

If you only have 9 feature frames, then after the first down-sampling layer only 1 feature frame is left. It is hard to decode something from only a single frame.

gabor-pinter commented 1 year ago

We suggest that the input tensor has to have 23 feature frames.

Thank you for the comment and the suggestion.

I would like to take some more steps towards avoiding server crashes.

(1) Is the minimal size requirement model/architecture specific? -> if yes, is there a way to know it - by querying the model?

(2) Would a more aggressive padding on the sample (or feature) domain solve the issue?

Again, thank you for your helpful comments.

csukuangfj commented 1 year ago

(1) Is the minimal size requirement model/architecture specific?

Yes, it is hard-coded in the model since all zipformer models from icefall have such a constraint.

(2) Would a more aggressive padding on the sample (or feature) domain solve the issue?

Yes, I think it would fix the issue by padding the input wave.