microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.98k stars 1.81k forks source link

Does NNI ModelSpeedupTensorRT support Encoder-Decoder models? #5801

Open donjuanpond opened 1 month ago

donjuanpond commented 1 month ago

Question: I have an encoder decoder model, quantized using TensorRT's packages for post-training quantization. It is in the HuggingFace transformers saved model format. The model is a TrOCR model, which is implemented with the HuggingFace VisionEncoderDecoder class. With Transformers, the encoder and decoder are in a single file, but when saving to ONNX format, the encoder and decoder become two different onnx files.

I am trying to run this model through ModelSpeedupTensorRT, using the tutorial here: https://nni.readthedocs.io/en/stable/tutorials/quantization_speedup.html. When I tried to do engine.compress_with_calibrator(calib) with a calibrator I made from a dataloader, I had an error where my CPU RAM was being taken up by the conversion to ONNX format for some reason. To solve this, I had to convert the model myself, using the HuggingFace Optimum interface for ONNX Runtime.

When editing the source code to accomodate for this, I found the implementation of the build_engine_with_calib() method being called by compress_with_calibrator():

def build_engine_with_calib(onnx_model_file, calib, input_shape):
    """
    Parameters
    ----------
    """
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(common.explicit_batch())
    trt_config = builder.create_builder_config()
    parser = trt.OnnxParser(network, TRT_LOGGER)

    builder.max_batch_size = input_shape[0]
    trt_config.max_workspace_size = common.GiB(8)
    trt_config.set_flag(trt.BuilderFlag.INT8)
    trt_config.set_flag(trt.BuilderFlag.FP16)
    trt_config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
    trt_config.int8_calibrator = calib

    with open(onnx_model_file, 'rb') as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                TRT_LOGGER.log(TRT_LOGGER.ERROR, parser.get_error(error))
            raise ValueError('Failed to parse the ONNX file.')

    TRT_LOGGER.log(TRT_LOGGER.INFO, f'input number: {network.num_inputs}')
    TRT_LOGGER.log(TRT_LOGGER.INFO, f'output number: {network.num_outputs}')

    profile = builder.create_optimization_profile()
    input_name = network.get_input(0).name
    profile.set_shape(input_name, min=input_shape, opt=input_shape, max=input_shape)
    trt_config.add_optimization_profile(profile)

    config_network_to_int8(network) # not sure whether it is necessary because trt.BuilderFlag.INT8 is set.

    engine = builder.build_engine(network, trt_config)
    return engine

I noticed here that the ONNX model is being read as a single file, not from a directory. Because of this, will my Vision Encoder Decoder model not work with the ModelSpeedup, as it is saved as two different files?? Is there any way for me to make it work??

somebody4545-alt commented 4 weeks ago

This question's really been bugging me too. Any solutions?