Open MrRace opened 2 years ago
Without info on how exactly you're running the model it's hard to say.
Without info on how exactly you're running the model it's hard to say. Both T4 GPU and V100 GPU can not improve the inference, my inference code:
import onnxruntime as rt sess_options = rt.SessionOptions() # Set graph optimization level sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED # To enable model serialization after graph optimization set this sess_options.optimized_model_filepath = "<model_output_path\optimized_model.onnx>" session = rt.InferenceSession("<model_path>", sess_options)
@skottmckay Here is some previous experiments
You could temporarily set the logging level to see what EP nodes are assigned to, to check they're all assigned to the CUDA EP. Look for 'Node placements' in the output.
sess_options.log_severity_level = 0 # VERBOSE level
If all nodes are assigned to the CUDA EP it's possible some of the operators don't have an fp16 implementation, causing the insertion of Cast nodes to cast between fp32 and fp16. You could inspect the optimized model written to optimized_model_filepath
to see if that is the case.
I use
torch.onnx.export
to convert yolov5 which is Pytorch model to onnx format. The raw onnx model ismodel.onnx
. Then I use onnxmltools to convert the float32 onnx to float16 onnx:When use onnxruntime-gpu(1.8) to do inference for
model_fp16.onnx
it does not improve~ Why?