Closed Phuoc-Hoan-Le closed 1 year ago
Interesting, I guess there is some variability across GPUs / setups there.
What exactly do you mean by "bounded to GPU". Can you send a code snippet or PR?
The code example snippet can be found in the link, https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device , where it gives you example on how to bind your inputs/outputs to the GPU.
And I get
Whereas if I replace the line
wtp = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
withI get
Although PyTorch implementation is on average slower because outlier from the first run, removing the initial outlier from the first PyTorch run makes it on average faster than ONNX run.
I see the inputs are not bounded to GPU in (https://github.com/bminixhofer/wtpsplit/blob/main/wtpsplit/extract.py). Could you please try to binding them to see if it faster?