rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.27k stars 535 forks source link

kNN Classifier Accuracy deviating from scikit-learn[BUG] #5773

Open evanhowington opened 9 months ago

evanhowington commented 9 months ago

Describe the bug I was comparing the results of my work converted to use cuML over scikit-learn, with respect to the kNN Classification. For cuML when I run a test size of 10% my test accuracy crosses above my training accuracy around k=100 but the same code ran on normal scikit-learn the accuracy curves stay strictly separated with no crossover. Then, when i increase the test size to 20% i get the opposite result with my cuML accuracy curves staying strictly separated and my scikit-learn curves beginning their crossover around k=60. will include a screenshot in the attachments.

Steps/Code to reproduce bug I have provided both sets of code using cuml and scikit-learn

Expected behavior I would expect the accuracy to be relatively the same using cuml and scikit-learn, however I am producing deviations.

Environment details (please complete the following information):

aethyn@pop-os:~$ neofetch ///////////// aethyn@pop-os ///////////////////// ------------- ///////767//////////////// OS: Pop!_OS 22.04 LTS x86_64 //////7676767676////////////// Kernel: 6.6.10-76060610-generic /////76767//7676767////////////// Uptime: 5 hours, 9 mins /////767676///76767/////////////// Packages: 1982 (dpkg), 25 (flatpak) ///////767676///76767.///7676/////// Shell: bash 5.1.16 /////////767676//76767///767676//////// Resolution: 3840x2160, 3840x2160, 3840x2160 //////////76767676767////76767///////// DE: GNOME 42.5 ///////////76767676//////7676////////// WM: Mutter ////////////,7676,///////767/////////// WM Theme: Pop /////////////*7676///////76//////////// Theme: Pop-dark [GTK2/3] ///////////////7676//////////////////// Icons: Pop [GTK2/3] ///////////////7676///767//////////// Terminal: gnome-terminal //////////////////////'//////////// CPU: AMD Ryzen 9 7950X (32) @ 5.881GHz //////.7676767676767676767,////// GPU: AMD ATI 6c:00.0 Device 164e /////767676767676767676767///// GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2684 /////////////////////////// Memory: 18801MiB / 63423MiB /////////////////////

Additional context duplicate_this.zip

bdice commented 9 months ago

@evanhowington Thanks for the issue! You mentioned on Slack that the zip file with your data wasn't uploaded. Can you try that again? There is a 25 MB file size limit for zip files, so you may need to split up the data (you mentioned the size was a few megabytes).

evanhowington commented 9 months ago

@bdice I updated the original post to include the zip file at the bottom of it under "Additional Context".

evanhowington commented 9 months ago

I did some digging and it appears scikit-learn uses a numpy random state instance while cuML uses a cupy random state instance by default with an option of using a numpy random state instance. https://scikit-learn.org/stable/glossary.html#term-random_state https://docs.rapids.ai/api/cuml/stable/api/#preprocessing-metrics-and-utilities

I have not had a chance to test the numpy random state instance on cuML yet. I'm still trying to figure out to invoke the optional numpy random state instance in cuML. Is it just calling numpy.random.RandomState in the cuML as follows: random_state = numpy.random.RandomState ?

If it is the random_state causing the discrepancy perhaps something like train_test_split(X, y, test_size=0.1, random_state=42, random_state_environment={"cupy", "numpy"})where one specifies where to pull the random state from. Also, maybe the default could be numpy so that the results would match up with someone running the same code on scikit-learn, with the option to be to choose cupy. I only suggest that because if the desire is for them to produce equivalent results out of the box with cuML offering a speedup, we recognize that scikit-learn cant always call a cupy random state on all devices so the cuML default could be a numpy random state for the sake of reproducible results.

dantegd commented 8 months ago

Thanks for the issue @evanhowington, I had written a response and closed my tab before submitting :(.

The issue very likely is not coming from using the random state either from numpy or cupy. Haven't yet tested it myself, but given the difference in the parallel/CUDA code it might just be an inherent difference.