siddharth-sharma7 / fast-Bart

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed
33 stars 3 forks source link

unable to use cudaprovider for inferencing #2

Open girishnadiger-gep opened 2 years ago

girishnadiger-gep commented 2 years ago

Hi @siddharth-sharma7 , You package is great and very easy to use, but I'm unable to figure out how to actually use CUDAExecutionProvider, and use gpu for inferencing. Whenever I provide the providers=['CUDAExecutionProvider'], the model is still not being loaded to gpu and inferencing still happens in cpu.

sidsharma72 commented 2 years ago

For GPU inference, you would have to use onnxruntime-gpu. I haven't necessarily used the onnx Bart models on GPU, but in the current shape and form, it wouldn't directly work.

girishnadiger-gep commented 2 years ago

Hi @sidsharma72 , I've tried that, using onnxruntime-gpu, facing the same issue. I agree that it wouldn't directly work. I'm working on this aspect and will contribute to this repo if I get something meaningful to work, on an adapted version of fast-Bart

Luckick commented 11 months ago

Hi @sidsharma72 , I've tried that, using onnxruntime-gpu, facing the same issue. I agree that it wouldn't directly work. I'm working on this aspect and will contribute to this repo if I get something meaningful to work, on an adapted version of fast-Bart

Hi @girishnadiger-gep do you have any updates on the GPU inference? Thanks!