Triton ensemble model configuration for transducer models

pavankumar-ds commented 1 year ago

Hello, could you please give a reference configuration for the ensemble transducer model in the example repository for a pure triton-based inference? Specifically, how do we interface the variable y from the scorer back to the decoder input? Also, the template misses the joiner_decoder_proj and joiner_decoder_proj parts.

uni-saurabh-vyas commented 1 year ago

I am working on it now, and managed to follow the docs to make triton server up and running(all components including ensemble transducer are up and running), but when I am try to start client script it fails.

python3 decode_manifest_triton.py --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts_1.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming  --model-name transducer --chunk_size 16 --context 2

tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence

One weird thing I noticed was that when I start the server, I see these warnings/errors

Cleaning up...
free(): invalid pointer
free(): invalid pointer

Is this possibly related to memory leak issue ? https://github.com/triton-inference-server/server/issues/3777

uni-saurabh-vyas commented 1 year ago

Also, I am trying pretrained model from section "Deploy onnx with arbitrary pruned_transducer_stateless_X(2,3,4,5) model for Chinese or English recipes" at https://github.com/k2-fsa/sherpa/tree/master/triton

After downloading the model files, I am getting the following error:

./pruned_transducer_stateless3/export_onnx.py \
    --exp-dir ./icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp \
    --tokenizer-file ./icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model \
    --epoch 999 \
    --avg 1 \
    --streaming-model 1 \
    --causal-convolution 1 \
    --onnx 1 \
    --left-context 64 \
    --right-context 4 \
    --fp16

      sp.load(params.tokenizer_file)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

uni-saurabh-vyas commented 1 year ago

@csukuangfj

I am getting this error when I try to run default streaming example provided in sherpa/triton folder (https://github.com/k2-fsa/sherpa/tree/master/triton/model_repo_streaming)

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) for model instance 'feature_extractor_0_0', message: Exception: ('Invalid first chunk size', 9360, 14880)

https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L46

Did you guys encounter this issue as well ? What is your current status on transducer setup for triton, is it stable for you guys ? I would appreciate if you can suggest some pointers to address this issue, I can spend some time to fix issues if there any known issues, or if you want me try something else to make it work.

Also, as mentioned in previous comment(https://github.com/k2-fsa/sherpa/issues/371#issuecomment-1530912862), I suspect its related kaldifeats library memory leak issue, if this is a known issue, do you suggest to try using a different library for feature extraction ?

csukuangfj commented 1 year ago

@yuekaizhang

Could you help to have a look at this issue?

yuekaizhang commented 1 year ago

I am working on it now, and managed to follow the docs to make triton server up and running(all components including ensemble transducer are up and running), but when I am try to start client script it fails.
python3 decode_manifest_triton.py --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts_1.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming  --model-name transducer --chunk_size 16 --context 2

tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence
One weird thing I noticed was that when I start the server, I see these warnings/errors
Cleaning up...
free(): invalid pointer
free(): invalid pointer
Is this possibly related to memory leak issue ? triton-inference-server/server#3777

Hi, thanks for trying this triton recipe.

inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency.
free(): invalid pointer This warning (which I have no idea yet) should be fine.
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) for model instance 'feature_extractor_0_0', message: Exception: ('Invalid first chunk size', 9360, 14880) This issue is caused by --context 2, you should use --encoder_right_context which is for icefall models. https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L161 This is for wenet models.

uni-saurabh-vyas commented 1 year ago

Hi @yuekaizhang thanks for your response.

I ensured that the config parameters in $model_repo_path/*/config.pbtxt are matching properties as per onnx export log file icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp/onnx_export.log

For reference:

ENCODER_LEFT_CONTEXT: 64
ENCODER_RIGHT_CONTEXT: 4
ENCODER_DIM: 512
DECODER_DIM: 512
VOCAB_SIZE: 500
DECODER_CONTEXT_SIZE: 2
CNN_MODULE_KERNEL: 31
ENCODER_LAYERS: 12
All params:{'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 02:26:16 2023', 'lhotse-version': '1.12.0', 'torch-version': '1.13.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.1', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/mnt/efs/dspavankumar/tools/icefall', 'k2-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-40-5-20', 'IP address': '127.0.0.1'}, 'epoch': 1111, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp'), 'tokenizer_file': './icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model', 'onnx': True, 'context_size': 2, 'left_context': 64, 'right_context': 4, 'streaming_model': True, 'fp16': True, 'dynamic_chunk_training': False, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'blank_id': 0, 'vocab_size': 500}

Then I ran the client again python3 decode_manifest_triton.py --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer

Still getting same error

task-48: 0/221
task-49: 0/221
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_c
omplete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py",
 line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", lin
e 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) fo
r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

At:
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(47): add_wavs
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(221): execute

" This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency."

I also tried with --num-tasks 1 argument, but it still fails.

/mnt/efs/dspavankumar/tools/sherpa/triton/client$ python3 decode_manifest_triton.py --num-tasks 1 --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer
task-0: 0/11077
/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/audio.py:164: UserWarning: You requested a subset of a recording that is read from disk via a bash command. Expect large I/O overhead if you are going to read many chunks like these, since every time we will read the whole file rather than its subset.
  warnings.warn(
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py", line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10086 to model 'feature_extractor' must specify the START flag on the first request of the sequence

yuekaizhang commented 1 year ago

Hi @yuekaizhang thanks for your response.

I ensured that the config parameters in $model_repo_path/*/config.pbtxt are matching properties as per onnx export log file icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp/onnx_export.log

For reference:

ENCODER_LEFT_CONTEXT: 64
ENCODER_RIGHT_CONTEXT: 4
ENCODER_DIM: 512
DECODER_DIM: 512
VOCAB_SIZE: 500
DECODER_CONTEXT_SIZE: 2
CNN_MODULE_KERNEL: 31
ENCODER_LAYERS: 12
All params:{'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 02:26:16 2023', 'lhotse-version': '1.12.0', 'torch-version': '1.13.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.1', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/mnt/efs/dspavankumar/tools/icefall', 'k2-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-40-5-20', 'IP address': '127.0.0.1'}, 'epoch': 1111, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp'), 'tokenizer_file': './icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model', 'onnx': True, 'context_size': 2, 'left_context': 64, 'right_context': 4, 'streaming_model': True, 'fp16': True, 'dynamic_chunk_training': False, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'blank_id': 0, 'vocab_size': 500}

Then I ran the client again python3 decode_manifest_triton.py --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer

Still getting same error

task-48: 0/221
task-49: 0/221
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_c
omplete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py",
 line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", lin
e 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) fo
r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

At:
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(47): add_wavs
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(221): execute

" This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency."

I also tried with --num-tasks 1 argument, but it still fails.

/mnt/efs/dspavankumar/tools/sherpa/triton/client$ python3 decode_manifest_triton.py --num-tasks 1 --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer
task-0: 0/11077
/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/audio.py:164: UserWarning: You requested a subset of a recording that is read from disk via a bash command. Expect large I/O overhead if you are going to read many chunks like these, since every time we will read the whole file rather than its subset.
  warnings.warn(
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py", line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10086 to model 'feature_extractor' must specify the START flag on the first request of the sequence

https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L381-L383

Here, please check your first_chunk_ms, decoding_window_length, decode_window_length = (args.chunk_size + 2 + args.encoder_right_context) * args.subsampling + 3 decode_window_length should be (16 + 2 + 4)4 + 3 = 91, `first_chunk_ms = (decode_window_length + add_frames) frame_shift_ms` add_frames should be 2.

uni-saurabh-vyas commented 1 year ago

I have checked, these values are correct.

ipdb> decode_window_length
91
ipdb> print(args.chunk_size )
16
ipdb> print(args.encoder_right_context)
4
ipdb> print(args.subsampling)
4
ipdb> print(add_frames)
2
ipdb> print(frame_shift_ms)
10

yuekaizhang commented 1 year ago

Failed to process the request(s) fo r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

If the values are correct, could you trace back to figure out how do you get this 12640 number?

uni-saurabh-vyas commented 1 year ago

Hi @yuekaizhang

I noticed that in the wav_segs(https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L269), in the last segment, the number of samples(length) are different from all other segments, causing an issue.

So after adding del(wav_segs[-1])

at https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L282 problem is fixed.

Do you think this is a bug?

yuekaizhang commented 1 year ago

I am not sure. If it is a bug, it will exist in feature_extractor/1/model.py rather than this client here. Could you make sure that https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L52 here assert len(self.wav) > 0 always hold? Otherwise, there is a problem somewhere.

Since if you keep that last seg, I don't understand https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L45 why len(self.wav) become 0 except for first chunk.

How do you fix this previous issue inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence I think it may be related to the outdate request.

uni-saurabh-vyas commented 1 year ago

Good observation, so that error was caused due to few very short cuts present in jsonl, I used a different cuts file(which didnt have very short (<0.3 seconds cuts)), and I think that might have fixed that particular issue.

yuekaizhang commented 1 year ago

Okay, close the issue since it is fixed.

k2-fsa / sherpa

Triton ensemble model configuration for transducer models #371