netease-youdao / BCEmbedding

Netease Youdao's open-source embedding and reranker models for RAG products.
Apache License 2.0
1.3k stars 85 forks source link

关于推理加速问题。 #41

Open Gcstk opened 4 months ago

Gcstk commented 4 months ago

各位的工作特别是双语这块,给rag开源社区带来了巨大贡献!目前在部署推理的时候,转为onnx过后精度不见损失,使用的opset版本为17,torch版本为2.1.2,onnx版本为:1.14.1。 onnx转换出来精度未见损失,但是将onnx转为trt时报warring: [2024-04-05 03:07:58 WARNING] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. 得到的trt模型,精度误差特别大,实际使用作为召回也只有3%左右准确度。请问这是因为这个模型中int64的原因嘛?但按理来说不应该这么大的精度损失,麻烦有空帮助解答一下,万分感谢。 转换代码如下:

model = AutoModel.from_pretrained('./bce-emb')

def make_train_dummy_input(seq_len):
    org_input_ids = torch.tensor(
        [[i for i in range(seq_len)]], dtype=torch.int32)
    org_input_mask = torch.tensor([[1 for i in range(int(
        seq_len/2))] + [1 for i in range(seq_len - int(seq_len/2))]], dtype=torch.int32)
    return (org_input_ids.to(device), org_input_mask.to(device))

model.eval()

with torch.no_grad():
    model=model.to(device)
    org_dummy_input = make_train_dummy_input(64)
    # print(org_dummy_input)
    output = torch.onnx.export(model,
                               org_dummy_input,
                               "model17.onnx",
                               verbose=True,
                               opset_version=17,
                               # 需要注意顺序!不可随意改变, 否则结果与预期不符
                               input_names=[
                                   'input_ids', 'attention_mask'],
                               # 需要注意顺序, 否则在推理阶段可能用错output_names
                               output_names=['logits'],
                               do_constant_folding=True,
                               dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence_length"},
                                             "attention_mask": {0: "batch_size", 1: "sequence_length"},
                                             "logits": {0: "batch_size"}
                                            })

trt转换如下:环境为nvcr.io/nvidia/tensorrt:23.06-py3 nvidia的官方docker,tensorrt版本为8.6.1。 trt转换cli:

trtexec --onnx=/workspace/bce-emb.onnx \
--saveEngine=/workspace/model.plan \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:4x128,attention_mask:4x128 \
--maxShapes=input_ids:64x512,attention_mask:64x512 \
--memPoolSize=workspace:8192MiB\
--fp16

log信息如下:

[04/05/2024-12:27:54] [I] === Model Options ===
[04/05/2024-12:27:54] [I] Format: ONNX
[04/05/2024-12:27:54] [I] Model: /workspace/bce-emb.onnx
[04/05/2024-12:27:54] [I] Output:
[04/05/2024-12:27:54] [I] === Build Options ===
[04/05/2024-12:27:54] [I] Max batch: explicit batch
[04/05/2024-12:27:54] [I] Memory Pools: workspace: 8192 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[04/05/2024-12:27:54] [I] minTiming: 1
[04/05/2024-12:27:54] [I] avgTiming: 8
[04/05/2024-12:27:54] [I] Precision: FP32
[04/05/2024-12:27:54] [I] LayerPrecisions:
[04/05/2024-12:27:54] [I] Layer Device Types:
[04/05/2024-12:27:54] [I] Calibration:
[04/05/2024-12:27:54] [I] Refit: Disabled
[04/05/2024-12:27:54] [I] Version Compatible: Disabled
[04/05/2024-12:27:54] [I] TensorRT runtime: full
[04/05/2024-12:27:54] [I] Lean DLL Path:
[04/05/2024-12:27:54] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/05/2024-12:27:54] [I] Exclude Lean Runtime: Disabled
[04/05/2024-12:27:54] [I] Sparsity: Disabled
[04/05/2024-12:27:54] [I] Safe mode: Disabled
[04/05/2024-12:27:54] [I] Build DLA standalone loadable: Disabled
[04/05/2024-12:27:54] [I] Allow GPU fallback for DLA: Disabled
[04/05/2024-12:27:54] [I] DirectIO mode: Disabled
[04/05/2024-12:27:54] [I] Restricted mode: Disabled
[04/05/2024-12:27:54] [I] Skip inference: Disabled
[04/05/2024-12:27:54] [I] Save engine: /workspace/model.plan
[04/05/2024-12:27:54] [I] Load engine:
[04/05/2024-12:27:54] [I] Profiling verbosity: 0
[04/05/2024-12:27:54] [I] Tactic sources: Using default tactic sources
[04/05/2024-12:27:54] [I] timingCacheMode: local
[04/05/2024-12:27:54] [I] timingCacheFile:
[04/05/2024-12:27:54] [I] Heuristic: Disabled
[04/05/2024-12:27:54] [I] Preview Features: Use default preview flags.
[04/05/2024-12:27:54] [I] MaxAuxStreams: -1
[04/05/2024-12:27:54] [I] BuilderOptimizationLevel: -1
[04/05/2024-12:27:54] [I] Input(s)s format: fp32:CHW
[04/05/2024-12:27:54] [I] Output(s)s format: fp32:CHW
[04/05/2024-12:27:54] [I] Input build shape: input_ids=1x1+4x128+64x512
[04/05/2024-12:27:54] [I] Input build shape: attention_mask=1x1+4x128+64x512
[04/05/2024-12:27:54] [I] Input calibration shapes: model
[04/05/2024-12:27:54] [I] === System Options ===
[04/05/2024-12:27:54] [I] Device: 0
[04/05/2024-12:27:54] [I] DLACore:
[04/05/2024-12:27:54] [I] Plugins:
[04/05/2024-12:27:54] [I] setPluginsToSerialize:
[04/05/2024-12:27:54] [I] dynamicPlugins:
[04/05/2024-12:27:54] [I] ignoreParsedPluginLibs: 0
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] === Inference Options ===
[04/05/2024-12:27:54] [I] Batch: Explicit
[04/05/2024-12:27:54] [I] Input inference shape: attention_mask=4x128
[04/05/2024-12:27:54] [I] Input inference shape: input_ids=4x128
[04/05/2024-12:27:54] [I] Iterations: 10
[04/05/2024-12:27:54] [I] Duration: 3s (+ 200ms warm up)
[04/05/2024-12:27:54] [I] Sleep time: 0ms
[04/05/2024-12:27:54] [I] Idle time: 0ms
[04/05/2024-12:27:54] [I] Inference Streams: 1
[04/05/2024-12:27:54] [I] ExposeDMA: Disabled
[04/05/2024-12:27:54] [I] Data transfers: Enabled
[04/05/2024-12:27:54] [I] Spin-wait: Disabled
[04/05/2024-12:27:54] [I] Multithreading: Disabled
[04/05/2024-12:27:54] [I] CUDA Graph: Disabled
[04/05/2024-12:27:54] [I] Separate profiling: Disabled
[04/05/2024-12:27:54] [I] Time Deserialize: Disabled
[04/05/2024-12:27:54] [I] Time Refit: Disabled
[04/05/2024-12:27:54] [I] NVTX verbosity: 0
[04/05/2024-12:27:54] [I] Persistent Cache Ratio: 0
[04/05/2024-12:27:54] [I] Inputs:
[04/05/2024-12:27:54] [I] === Reporting Options ===
[04/05/2024-12:27:54] [I] Verbose: Disabled
[04/05/2024-12:27:54] [I] Averages: 10 inferences
[04/05/2024-12:27:54] [I] Percentiles: 90,95,99
[04/05/2024-12:27:54] [I] Dump refittable layers:Disabled
[04/05/2024-12:27:54] [I] Dump output: Disabled
[04/05/2024-12:27:54] [I] Profile: Disabled
[04/05/2024-12:27:54] [I] Export timing to JSON file:
[04/05/2024-12:27:54] [I] Export output to JSON file:
[04/05/2024-12:27:54] [I] Export profile to JSON file:
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] === Device Information ===
[04/05/2024-12:27:54] [I] Selected Device: NVIDIA A10
[04/05/2024-12:27:54] [I] Compute Capability: 8.6
[04/05/2024-12:27:54] [I] SMs: 72
[04/05/2024-12:27:54] [I] Device Global Memory: 22731 MiB
[04/05/2024-12:27:54] [I] Shared Memory per SM: 100 KiB
[04/05/2024-12:27:54] [I] Memory Bus Width: 384 bits (ECC enabled)
[04/05/2024-12:27:54] [I] Application Compute Clock Rate: 1.695 GHz
[04/05/2024-12:27:54] [I] Application Memory Clock Rate: 6.251 GHz
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] TensorRT version: 8.6.1
[04/05/2024-12:27:54] [I] Loading standard plugins
[04/05/2024-12:27:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +520, GPU +0, now: CPU 537, GPU 13924 (MiB)
[04/05/2024-12:28:01] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1436, GPU +266, now: CPU 2050, GPU 14190 (MiB)
[04/05/2024-12:28:01] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usageand speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[04/05/2024-12:28:01] [I] Start parsing network model.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1118829273
[04/05/2024-12:28:09] [I] [TRT] ----------------------------------------------------------------
[04/05/2024-12:28:09] [I] [TRT] Input filename:   /workspace/bce-emb.onnx
[04/05/2024-12:28:09] [I] [TRT] ONNX IR version:  0.0.8
[04/05/2024-12:28:09] [I] [TRT] Opset version:    17
[04/05/2024-12:28:09] [I] [TRT] Producer name:    pytorch
[04/05/2024-12:28:09] [I] [TRT] Producer version: 2.1.2
[04/05/2024-12:28:09] [I] [TRT] Domain:
[04/05/2024-12:28:09] [I] [TRT] Model version:    0
[04/05/2024-12:28:09] [I] [TRT] Doc string:
[04/05/2024-12:28:09] [I] [TRT] ----------------------------------------------------------------
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1118829273
[04/05/2024-12:28:11] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/05/2024-12:28:12] [I] Finished parsing network model. Parse time: 10.2094
[04/05/2024-12:28:12] [I] [TRT] Graph optimization time: 0.0657748 seconds.
[04/05/2024-12:28:12] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/05/2024-12:28:28] [I] [TRT] Detected 2 inputs and 2 output network tensors.
[04/05/2024-12:28:31] [I] [TRT] Total Host Persistent Memory: 48
[04/05/2024-12:28:31] [I] [TRT] Total Device Persistent Memory: 0
[04/05/2024-12:28:31] [I] [TRT] Total Scratch Memory: 2114454528
[04/05/2024-12:28:31] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1060 MiB, GPU 3512MiB
[04/05/2024-12:28:31] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[04/05/2024-12:28:31] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.013715ms to assign 2 blocks to 2 nodes requiring 2114455040 bytes.
[04/05/2024-12:28:31] [I] [TRT] Total Activation Memory: 2114455040
[04/05/2024-12:28:31] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +2048, now: CPU 0, GPU 2048 (MiB)
[04/05/2024-12:28:39] [I] Engine built in 44.7706 sec.
[04/05/2024-12:28:40] [I] [TRT] Loaded engine size: 1063 MiB
[04/05/2024-12:28:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1060,now: CPU 0, GPU 1060 (MiB)
[04/05/2024-12:28:40] [I] Engine deserialized in 0.121818 sec.
[04/05/2024-12:28:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2017, now: CPU 0, GPU 3077 (MiB)
[04/05/2024-12:28:40] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usageand speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[04/05/2024-12:28:40] [I] Setting persistentCacheLimit to 0 bytes.
[04/05/2024-12:28:40] [I] Using random values for input input_ids
[04/05/2024-12:28:40] [I] Input binding for input_ids with dimensions 4x128 is created.
[04/05/2024-12:28:40] [I] Using random values for input attention_mask
[04/05/2024-12:28:40] [I] Input binding for attention_mask with dimensions 4x128 is created.
[04/05/2024-12:28:40] [I] Output binding for logits with dimensions 4x128x768 is created.
[04/05/2024-12:28:40] [I] Output binding for 1488 with dimensions 4x768 is created.
[04/05/2024-12:28:40] [I] Starting inference
[04/05/2024-12:28:43] [I] Warmup completed 44 queries over 200 ms
[04/05/2024-12:28:43] [I] Timing trace has 634 queries over 3.01157 s
[04/05/2024-12:28:43] [I]
[04/05/2024-12:28:43] [I] === Trace details ===
[04/05/2024-12:28:43] [I] Trace averages of 10 runs:
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71777 ms - Host latency: 4.8111 ms (enqueue 4.68778 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.69606 ms - Host latency: 4.78887 ms (enqueue 4.66904 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71172 ms - Host latency: 4.80529 ms (enqueue 4.6842 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71255 ms - Host latency: 4.80536 ms (enqueue 4.68546 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71368 ms - Host latency: 4.8067 ms (enqueue 4.68657 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71378 ms - Host latency: 4.8076 ms (enqueue 4.68813 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7095 ms - Host latency: 4.80175 ms (enqueue 4.68302 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71083 ms - Host latency: 4.80322 ms (enqueue 4.68549 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70866 ms - Host latency: 4.80078 ms (enqueue 4.68127 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71213 ms - Host latency: 4.80558 ms (enqueue 4.68696 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7102 ms - Host latency: 4.80333 ms (enqueue 4.68237 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7098 ms - Host latency: 4.80272 ms (enqueue 4.68302 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71245 ms - Host latency: 4.80533 ms (enqueue 4.68322 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.79283 ms - Host latency: 4.88607 ms (enqueue 4.73782 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 5.08766 ms - Host latency: 5.18107 ms (enqueue 5.05737 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.84741 ms - Host latency: 4.94027 ms (enqueue 4.84573 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74828 ms - Host latency: 4.84208 ms (enqueue 4.72477 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71704 ms - Host latency: 4.81016 ms (enqueue 4.68956 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70856 ms - Host latency: 4.80144 ms (enqueue 4.6801 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.711 ms - Host latency: 4.80448 ms (enqueue 4.68986 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71707 ms - Host latency: 4.8125 ms (enqueue 4.68035 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71266 ms - Host latency: 4.80608 ms (enqueue 4.68488 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71327 ms - Host latency: 4.8064 ms (enqueue 4.68513 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70937 ms - Host latency: 4.80234 ms (enqueue 4.684 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71021 ms - Host latency: 4.80277 ms (enqueue 4.68435 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70916 ms - Host latency: 4.80209 ms (enqueue 4.68029 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71064 ms - Host latency: 4.80372 ms (enqueue 4.68414 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71089 ms - Host latency: 4.80433 ms (enqueue 4.68595 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71422 ms - Host latency: 4.80765 ms (enqueue 4.68622 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71155 ms - Host latency: 4.80374 ms (enqueue 4.68427 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70978 ms - Host latency: 4.80316 ms (enqueue 4.68428 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71857 ms - Host latency: 4.81082 ms (enqueue 4.68978 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71797 ms - Host latency: 4.8114 ms (enqueue 4.68748 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.80277 ms - Host latency: 4.89617 ms (enqueue 4.76615 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.90116 ms - Host latency: 4.99402 ms (enqueue 4.87052 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.84751 ms - Host latency: 4.94182 ms (enqueue 4.83071 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75146 ms - Host latency: 4.84404 ms (enqueue 4.72413 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74745 ms - Host latency: 4.84027 ms (enqueue 4.72244 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74814 ms - Host latency: 4.84119 ms (enqueue 4.72034 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75085 ms - Host latency: 4.8439 ms (enqueue 4.72666 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70796 ms - Host latency: 4.79934 ms (enqueue 4.68162 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71643 ms - Host latency: 4.80862 ms (enqueue 4.68887 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70688 ms - Host latency: 4.79854 ms (enqueue 4.67922 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70933 ms - Host latency: 4.8021 ms (enqueue 4.68403 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71401 ms - Host latency: 4.80779 ms (enqueue 4.68694 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70757 ms - Host latency: 4.80063 ms (enqueue 4.67991 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70662 ms - Host latency: 4.79973 ms (enqueue 4.67981 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71475 ms - Host latency: 4.80798 ms (enqueue 4.68315 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7467 ms - Host latency: 4.83914 ms (enqueue 4.71975 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.73523 ms - Host latency: 4.828 ms (enqueue 4.70801 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74326 ms - Host latency: 4.83728 ms (enqueue 4.71604 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.72288 ms - Host latency: 4.81548 ms (enqueue 4.6989 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74504 ms - Host latency: 4.83687 ms (enqueue 4.71489 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76904 ms - Host latency: 4.86096 ms (enqueue 4.73633 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.82097 ms - Host latency: 4.91309 ms (enqueue 4.79429 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75395 ms - Host latency: 4.84707 ms (enqueue 4.73091 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.78035 ms - Host latency: 4.87405 ms (enqueue 4.74929 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76145 ms - Host latency: 4.85464 ms (enqueue 4.73542 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.77607 ms - Host latency: 4.86899 ms (enqueue 4.74966 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76257 ms - Host latency: 4.85547 ms (enqueue 4.73748 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75317 ms - Host latency: 4.84756 ms (enqueue 4.72434 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7521 ms - Host latency: 4.84453 ms (enqueue 4.72729 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74985 ms - Host latency: 4.84331 ms (enqueue 4.72151 ms)
[04/05/2024-12:28:43] [I]
[04/05/2024-12:28:43] [I] === Performance summary ===
[04/05/2024-12:28:43] [I] Throughput: 210.521 qps
[04/05/2024-12:28:43] [I] Latency: min = 4.78314 ms, max = 5.24432 ms, mean = 4.8347 ms, median = 4.80896 ms, percentile(90%) = 4.88477 ms, percentile(95%) = 4.93652 ms, percentile(99%) = 5.16418 ms
[04/05/2024-12:28:43] [I] Enqueue Time: min = 4.5 ms, max = 5.13123 ms, mean = 4.71434 ms, median = 4.69315 ms, percentile(90%) = 4.76709 ms, percentile(95%) = 4.8573 ms, percentile(99%) = 5.03857 ms
[04/05/2024-12:28:43] [I] H2D Latency: min = 0.00610352 ms, max = 0.0211182 ms, mean = 0.00693844 ms, median = 0.00683594 ms, percentile(90%) = 0.00756836 ms, percentile(95%) = 0.0078125 ms, percentile(99%) = 0.00830078 ms
[04/05/2024-12:28:43] [I] GPU Compute Time: min = 4.68994 ms, max = 5.15076 ms, mean = 4.74169 ms, median = 4.71545 ms, percentile(90%) = 4.79224 ms, percentile(95%) = 4.84351 ms, percentile(99%) = 5.07086 ms
[04/05/2024-12:28:43] [I] D2H Latency: min = 0.081543 ms, max = 0.0933533 ms, mean = 0.0860713 ms, median = 0.0859375 ms, percentile(90%) = 0.0877686 ms, percentile(95%) = 0.0881348 ms, percentile(99%) = 0.0895996 ms
[04/05/2024-12:28:43] [I] Total Host Walltime: 3.01157 s
[04/05/2024-12:28:43] [I] Total GPU Compute Time: 3.00623 s
[04/05/2024-12:28:43] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[04/05/2024-12:28:43] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[04/05/2024-12:28:43] [W] * GPU compute time is unstable, with coefficient of variance = 1.33615%.
[04/05/2024-12:28:43] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[04/05/2024-12:28:43] [I] Explanations of the performance metrics are printed in the verbose logs.

查看过log信息,有些异常的就只有INT64转为INT32的warring。请问你们遇到过这种问题吗?能提供一些参考的思路么 万分感谢。

同样在进行bce-rerank转换时,得到的plan模型,在推理单句时:

[["what is panda", "panda is an animal"]] 能得到和pytorch一致的推理结果。 但在执行[["what is panda", "panda is an animal"],["what is panda", "panda is an animal"] ]时triton推理出来未经处理的结果和pytorch完全不一致。就很奇怪... 请问对于rerank有没有开源的onnx或者plan模型呢?如果我解决上述问题也愿意贡献可直接使用的转换模型。 如能解答万分感谢!

shenlei1020 commented 4 months ago

看起来是batch 推理有问题。建议seq length固定到512试试

cherishhh commented 4 months ago

tensorrt版本8.6.1在推理多batch时有问题,换成9.3的版本就好了。

Gcstk commented 4 months ago

tensorrt版本8.6.1在推理多batch时有问题,换成9.3的版本就好了。

感谢,我去测试一下。请问这个和onnx版本有啥关系没有呢?你用的是什么opt_set版本的?

cherishhh commented 4 months ago

可以加个微信好友15927303165,我也在做这块的转换。