Unable to use batch prediction with generated Whisper model using Olive

david-sitsky commented 2 months ago

I've built various models using Olive with Whisper from examples, however despite seeing lots of references in the code within user_Script.py and friends to dynamic_axes and batch_size, the resulting final model doesn't seem to support batches? Using Netron it appears the internal encoder / decoder components do support batching, but not the pre-processing code?

Is there configuration to enable this or is it not possible? For reference, when I pass batched inputs, I get errors such as the following, which makes me think batching has not been enabled correctly in the model.

Is this possible to enable with Olive? Thanks in advance.

ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Invalid rank for input: num_beams Got: 2 Expected: 1 Please fix either the inputs/outputs or the model.

jambayk commented 2 months ago

Hi, like you mentioned the audio decoder used in pre-processing only supports 1-D graph https://github.com/microsoft/Olive/issues/354#issuecomment-1601242379 so it cannot be enabled. You can see here in the onnxruntime-extensions source code that there is no dynamic axes for batch https://github.com/microsoft/onnxruntime-extensions/blob/be29e28dd76f5fb8f2fdc7d9d3880be27b680ede/onnxruntime_extensions/_torch_cvt.py#L180

david-sitsky commented 2 months ago

Thanks @jambayk. Is there a way to configure Olive/whisper to create a model which does not do any pre-processing (I'll do it outside of the model) so that I get a model which can accept batched input?

jambayk commented 2 months ago

I haven't tried it myself before but you can try by removing the prepost component of the workflow when generating the workflow config https://github.com/microsoft/Olive/blob/1ce4b5f84b8a7eb3b9036ebe940d0017d457ca9b/examples/whisper/prepare_whisper_configs.py#L15

That should give you a model without any of pre-post processsing graphs. There is currently no option to just disable the preprocessing part and keep the post processor.

david-sitsky commented 2 months ago

I tried that, but sadly something still doesn't seem right?

Traceback (most recent call last):
  File "/usr/local/bin/olive", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/olive/cli/launcher.py", line 44, in main
    service.run()
  File "/usr/local/lib/python3.10/dist-packages/olive/cli/run.py", line 42, in run
    olive_run(**var_args)
  File "/usr/local/lib/python3.10/dist-packages/olive/workflows/run/run.py", line 297, in run
    return run_engine(package_config, run_config, data_root)
  File "/usr/local/lib/python3.10/dist-packages/olive/workflows/run/run.py", line 261, in run_engine
    engine.run(
  File "/usr/local/lib/python3.10/dist-packages/olive/engine/engine.py", line 267, in run
    run_result = self.run_accelerator(
  File "/usr/local/lib/python3.10/dist-packages/olive/engine/engine.py", line 339, in run_accelerator
    output_footprint = self.run_no_search(
  File "/usr/local/lib/python3.10/dist-packages/olive/engine/engine.py", line 431, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "/usr/local/lib/python3.10/dist-packages/olive/engine/engine.py", line 846, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "/usr/local/lib/python3.10/dist-packages/olive/engine/engine.py", line 1052, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "/usr/local/lib/python3.10/dist-packages/olive/systems/local.py", line 47, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "/usr/local/lib/python3.10/dist-packages/olive/evaluator/olive_evaluator.py", line 205, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "/usr/local/lib/python3.10/dist-packages/olive/evaluator/olive_evaluator.py", line 123, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "/usr/local/lib/python3.10/dist-packages/olive/evaluator/olive_evaluator.py", line 762, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "/usr/local/lib/python3.10/dist-packages/olive/evaluator/olive_evaluator.py", line 543, in _evaluate_onnx_latency
    latencies = session.time_run(
  File "/usr/local/lib/python3.10/dist-packages/olive/common/ort_inference.py", line 334, in time_run
    self.session.run(input_feed=input_feed, output_names=None)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 216, in run
    self._validate_input(list(input_feed.keys()))
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 198, in _validate_input
    raise ValueError(
ValueError: Required inputs (['input_features']) are missing from input feed (['max_length', 'min_length', 'num_beams', 'num_return_sequences', 'length_penalty', 'repetition_penalty', 'decoder_input_ids']).

david-sitsky commented 2 months ago

Ok, so I had to update whisper_dataset.py so that input_features was being passed in appropriately and removed the existing preprocessed audio input.

david-sitsky commented 2 months ago

@jambayk - I was able to create a model without prepost processing which suits me fine as I have code which can handle that. However while input_features and decoder_input_ids are "batched", the beam search parameters are not and this seems to prevent me from performing "batched inferencing". Is there a way to make all the parameters batched?

Any ideas or have I misunderstood something?

david-sitsky commented 1 month ago

I managed to get batching working now with the above model.

microsoft / Olive

Unable to use batch prediction with generated Whisper model using Olive #1288