rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.21k stars 530 forks source link

[QST] Different result while converting pytorch tensor using .numpy() and dl_pack #3091

Open HilbertXu opened 4 years ago

HilbertXu commented 4 years ago

Hi,

Thanks for your amazing libraries!

I am currently carrying on research using DBSCAN to segment car instance from 3D point cloud. However I find that when i convert pytorch tenosr to DataFrame in two different way, DBSCAN returns different results.

1. I use dl_pack as a bridge

from torch.utils.dlpack import to_dlpack, from_dlpack from cuml.cluster import DBSCAN as cumlDBSCAN

car_pts_tmp = to_dlpack(car_pts) car_pts_gpu = cudf.from_dlpack(car_pts_tmp) db_gpu = cumlDBSCAN(eps=1, min_samples=10) result = db_gpu.fit(car_pts_gpu) instance_labels = fromdlpack(result.labels.to_dlpack())`

here is the DataFrame i used

Screen Shot 2020-10-30 at 8 07 51 PM

here is the visualization result of DBSCAN, maybe a little bit hard to recognize

Screen Shot 2020-10-30 at 8 12 56 PM

2. I directly convert torch tensor to numpy array(using tensor.numpy()) and use it as input of DBSCAN

And it returns the result i want

Screen Shot 2020-10-30 at 8 15 34 PM

I set same hyperparameters of DBSCAN but got totally different results. Having no ideas about this weird problem, I mean the converted DataFrame looks pretty reasonable. Hoping can get some help here~

BTW: Can anyone give me some ideas about how to add this clustering part using cuml libraries into a auto-diff framework like pytorch and make it end-to-end?

miroenev commented 3 years ago

Hey @HilbertXu thanks for your kind words!

I'm having some difficulty replicating the error you are finding, are you able to provide a sample reproducer script?

One thing to look out for is the column order during dlpack conversions as captured in the UserWarning below. Have you tried transposing the car data you have in your example?

/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/io/dlpack.py:33: UserWarning: WARNING: cuDF from_dlpack() assumes column-major (Fortran order) input. If the input tensor is row-major, transpose it before passing it to this function.

I've attached a small demo notebook (zipped) that I used to test the dlpack conversions between pytorch and cudf/cuml. pytorch_dlpack.zip

Feel free to get back to us with additional questions. Thanks!

HilbertXu commented 3 years ago

Hey @miroenev thanks for your help!

Here I upload the data and script i used to test. It can reproduce the strange error i met, hoping it can provide some useful information.

script&data.zip

test_cuml-Jupyter-Notebook

BTW: If I want to use DBSCAN and KMEANS in an end-to-end pytorch network, what should I do to implement batch-processing rather than iterate over every sample of one batch? Using a for loop in pytorch module is too slow...

miroenev commented 3 years ago

Thanks for sharing the additional data and notebook @HilbertXu,

I was indeed able to reproduce your issue and there seems to be something strange about going through a DLPack conversion prior to feeding DBScan.

Hope to pass this along to the team so we can take a deeper look at this ASAP. I'll also try to get more info regarding batch-processing.

Minimal reproducer code snippet below; Re-attached notebook and data. dbscan_dlpack_repro.zip

import torch
import cuml
import cudf
import numpy as np
from torch.utils.dlpack import to_dlpack, from_dlpack

from cuml.cluster import DBSCAN as cumlDBSCAN

## load data, create multiple representations
car_pts = np.load('car_pts.npz')['car_pts']
car_pts_cudf = cudf.DataFrame(car_pts)
car_pts_tensor = torch.tensor(car_pts)
car_pts_dlpack_cudf = cudf.from_dlpack(
    torch.utils.dlpack.to_dlpack(car_pts_tensor)
)

assert car_pts_dlpack_cudf == car_pts_cudf

## DBScan with numpy.ndarray and cudf.DataFrame produces identical results
eps = 0.8
min_samples = 10

labels_np = cumlDBSCAN(eps=eps, min_samples=min_samples).fit_predict(car_pts)
unique_elements, counts = np.unique(labels_np, return_counts=True)
print(np.asarray((unique_elements, counts)))

labels_cudf = cumlDBSCAN(eps=eps, min_samples=min_samples).fit_predict(car_pts_cudf)
print(labels_cudf.value_counts())

## going through a dlpack conversion and back causes results to diverge
labels_cudf = cumlDBSCAN(eps=eps, min_samples=min_samples).fit_predict(car_pts_dlpack_cudf)
print(labels_cudf.value_counts())

p.s. Do you mind editing your last post to link to the screen-capture rather than embedding it in the comment body. It would be great to not overwhelm future readers with scrolling graphics. Thanks in advance!

HilbertXu commented 3 years ago

Hey @miroenev , Thanks for your quick reply!

I've changed the screen-capture into a link. I'm glad to find out a potential bug and make a slight contribution.

Well, I'm currently implement some cluster algorith such as DBSCAN, KMEANS into my project which uses pytorch framework. However, transform between pytorch tensor to cudf object is really time-consuming, especially when you have to use "for" loop to iterate over the whole batch rather than doing some batch-wise processing. So could you please provide some simple and clear examples to integrate cuml/cudf with pytorch? I would be grateful if I can get some help here

miroenev commented 3 years ago
EvenOldridge commented 3 years ago

I'm wondering if this isn't related to an issue we're currently chasing down with the dataloaders. @jperez999 for context.

@HilbertXu if you're doing looping within PyTorch it's possible that the buffers you're using are tensor views and are being overwritten by the next batch. A quick way to test this is to add an explicit clone of the tensor before calling to_dlpack.

That may not be what you're facing, but it's worth a try.

In terms of your visualization at the top is it possible that you're passing the tensors as row-major? It might be worth trying a transpose of the data before you pass it to see if that clears up the issue.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

allen123wzh commented 2 years ago

Hi, has this issue been fixed? I'm experiencing the same issue with a torch tensor output convert to dl_pack then cudf, passing along the cudf dataframe to cuml dbscan returns a similar strange result. Thanks.