KernelPCA + PyTorch - Githubissues

carolinemckee commented 2 years ago

Hello, I'm trying to utilize GPU with Pytorch backend to speed up a Kernel PCA operation. However, when I convert to Pytorch, it ends up taking about 9x longer to run the .transform() function. Additionally, I'm not seeing any GPU utilization at all. Sklearn: 0.8 seconds Pytorch + CPU: 7.8 seconds Pytorch + GPU (supposedly, but again, not seeing any GPU utilization): 7.9 seconds

Would it be possible for you to look into this? Have already checked that CUDA is configured correctly with torch.cuda.is_available(). Thanks!

ksaur commented 2 years ago

Hello! Can you please post what version of torch you are using, and some of your code so we can troubleshoot?

carolinemckee commented 2 years ago

Pytorch Version: 1.6.0+cu101

print(torch.cuda.is_available()) #prints True KPCA = KernelPCA(n_components=8, kernel='rbf', degree=3, gamma=5e-4) KPCA.fit_transform(train_data) KPCA_pytorch = convert(KPCA, 'torch') KPCA_pytorch.to('cuda')

t1 = time.time() infer_output = KPCA_pytorch.transform(infer_data) print('Time to infer:', time.time() - t1)

ksaur commented 2 years ago

I expect this model to be slower on torch+cpu than native scikit-learn (due to redundancies introduced by hummingbird, many models are slower on Torch+CPU), but I wanted to dig into your GPU numbers!

I was able to reproduce this with torch==1.6 (and also torch==1.9 which i was hoping would fix); you're right, i think the GPU is not engaged/fully utilized. Thoughts @interesaaat / @scnakandala?

# hardware: nvid p100
# torch 1.6

import torch, time
from hummingbird.ml import  convert
print(torch.cuda.is_available())
# True
from sklearn.decomposition import KernelPCA
KPCA = KernelPCA(n_components=8, kernel='rbf', degree=3, gamma=5e-4)
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

X, y = make_circles(n_samples=20000, factor=0.3, noise=0.05, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

KPCA.fit_transform(X_train)
KPCA_pytorch = convert(KPCA, 'torch')

## This is the CPU version
KPCA_pytorch.to('cpu')
t1 = time.time()
infer_output = KPCA_pytorch.transform( X_test)
t2 = time.time()
print('Time to infer:', t2 - t1)
# Time to infer: 0.43147897720336914

## Now with GPU
KPCA_pytorch.to('cuda')
t1 = time.time()
infer_output = KPCA_pytorch.transform( X_test)
t2 = time.time()
print('Time to infer:', t2 - t1)
#Time to infer: 0.38918399810791016

I also tried to append the cuda convert during KPCA_pytorch = convert(KPCA, 'torch', device='cuda') but that didn't change the outcome.

carolinemckee commented 2 years ago

Thank you for looking into this! Hopefully you can find a way to utilize the GPU. I'm curious- with regards to the Torch+CPU version, is there a way to mitigate the redundancies introduced by hummingbird? 9x longer than native scikit-learn is a huge difference...

ksaur commented 2 years ago

Thanks for reporting the issue! The redundancies with Torch+CPU are required for the model to work with torch/tensors. Often, the Torch+CPU prediction time is the same speed or faster than SKL, but not in all cases (such as this one)...it really depends on the underlying data structure. It's these redundancies that lead to the huge speedup when loaded as tensor into the GPU!

ksaur commented 2 years ago

Ok good news, @interesaaat found the bug where we didn't load the parameters into the model when we initialize it. And if there is no parameter we just go with cpu.

With this fix (PR + patch release incoming soon), it's also faster on torch+CPU as well. :) This is with torch==1.11 which has improved a lot the cpu kernels

## This is the sklearn version
t1 = time.time()
infer_output = KPCA.transform( X_test)
t2 = time.time()
print('Time to infer:', t2 - t1)
# Time to infer: 2.1712799072265625

## This is the CPU version
KPCA_pytorch.to('cpu')
t1 = time.time()
infer_output = KPCA_pytorch.transform( X_test)
t2 = time.time()
print('Time to infer:', t2 - t1)
# Time to infer: 0.5567188262939453

## Now with GPU
KPCA_pytorch.to('cuda')
t1 = time.time()
infer_output = KPCA_pytorch.transform( X_test)
t2 = time.time()
print('Time to infer:', t2 - t1)
#Time to infer: 0.021926164627075195

I'll try to get the release out later today. Thanks again for posting this issue @carolinemckee !

carolinemckee commented 2 years ago

Thank you!!

ksaur commented 2 years ago

@carolinemckee Please pip install hummingbird-ml==0.4.4 and you should be all set!

carolinemckee commented 2 years ago

Thank you so much! Appreciate the quick fix :)

carolinemckee commented 2 years ago

@ksaur @interesaaat Just some quick feedback looking at the code- in the newly changed lines 40, 42, and 48 of hummingbird/ml/operator_converters/_decomposition_implementations.py which define the torch.nn.Parameter variables needed to run the model on GPU, the argument "requires_grad" is not specified. When not specified, it defaults to True, meaning memory is being allocated for backpropogration/autograd. This is unnecessary, since we're just using Pytorch for inference. If "requires_grad=false" were added as an argument to the torch.nn.Parameter lines, it would save a lot of GPU memory and allow users to fit more data on the GPU!

interesaaat commented 2 years ago

Good catch @carolinemckee! Let me quickly fix it.

microsoft / hummingbird

KernelPCA + PyTorch #574