rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

fit modifies X inplace #5482

Closed casonk closed 1 year ago

casonk commented 1 year ago

https://github.com/rapidsai/cuml/blob/0c2a1035378bc8fd6def06f9896fa7361b92864b/python/cuml/linear_model/linear_regression.pyx#L302

issue was realized when fitting 2 models on the same X, Y inputs

request to implement Copy_X paramater set True by default as in sklearn

current workaround:

import cuml
lr = cuml.LinearRegression()
# reg = lr.fit(X, Y) # inplace modification of X
reg = lr.fit(X.copy(), Y) # preserves X

Note: issue may extend to other models

dantegd commented 1 year ago

Thanks for identifying the issue @casonk ! We'll try to repro and debug and update you asap.

viclafargue commented 1 year ago

Which version of cuML/RAPIDS are you running?

casonk commented 1 year ago
Environment ``` **cubinlinker 0.2.2+2.g2f92cb3** **cuda-python 12.1.0rc5+1.gcdeccdd** cudf 23.4.0 cugraph 23.4.0 cugraph-dgl 23.4.0 cugraph-service-client 23.4.0 cugraph-service-server 23.4.0 **cuml 23.4.0** **cupy-cuda12x 12.0.0b3** dask-cuda 23.4.0 dask-cudf 23.4.0 ``` ``` pip list Package Version ----------------------------- ------------------------------- absl-py 1.0.0 aiohttp 3.8.4 aiosignal 1.3.1 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.2.1 astunparse 1.6.3 async-generator 1.10 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 beautifulsoup4 4.12.2 black 23.3.0 bleach 6.0.0 blis 0.7.9 bokeh 3.1.1 cachetools 5.3.0 catalogue 2.0.8 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 3.1.0 clang 13.0.1 click 8.1.3 cloudpickle 2.2.1 comm 0.1.3 confection 0.0.4 contourpy 1.0.7 cramjam 2.6.2 cubinlinker 0.2.2+2.g2f92cb3 cuda-python 12.1.0rc5+1.gcdeccdd cudf 23.4.0 cugraph 23.4.0 cugraph-dgl 23.4.0 cugraph-service-client 23.4.0 cugraph-service-server 23.4.0 cuml 23.4.0 cupy-cuda12x 12.0.0b3 cycler 0.11.0 cymem 2.0.7 dask 2023.3.2 dask-cuda 23.4.0 dask-cudf 23.4.0 DateTime 5.1 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 distributed 2023.3.2.1 dynetx 0.3.2 exceptiongroup 1.1.1 executing 1.2.0 fastjsonschema 2.16.3 fastparquet 2023.4.0 fastrlock 0.8.1 filelock 3.12.2 flatbuffers 2.0 fonttools 4.39.3 frozenlist 1.3.3 fsspec 2023.4.0 future 0.18.3 gast 0.4.0 google-auth 2.17.3 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 graphsurgeon 0.4.6 grpcio 1.52.0 h11 0.14.0 h5py 3.6.0 horovod 0.27.0+nv23.5 huggingface-hub 0.15.1 idna 3.4 igraph 0.10.4 importlib-metadata 6.6.0 ipykernel 6.22.0 ipython 8.13.1 ipython-genutils 0.2.0 ipywidgets 8.0.6 jax 0.4.6 jedi 0.18.2 Jinja2 3.1.2 joblib 1.2.0 json5 0.9.11 jsonschema 4.17.3 jupyter_client 8.2.0 jupyter_core 5.3.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupyterlab-widgets 3.0.7 jupytext 1.14.5 keras 2.12.0 kiwisolver 1.4.4 langcodes 3.3.0 libclang 13.0.0 llvmlite 0.39.1 locket 1.0.0 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.5 mdurl 0.1.2 mistune 2.0.5 mock 3.0.5 msgpack 1.0.5 multidict 6.0.4 murmurhash 1.0.9 mypy-extensions 1.0.0 nbclient 0.7.4 nbconvert 7.3.1 nbformat 5.8.0 ndlib 5.1.1 nest-asyncio 1.5.6 netdispatch 0.1.0 networkx 2.8.8 ninja 1.11.1 nltk 3.8.1 notebook 6.4.10 NRCLex 3.0.0 numba 0.56.4+1.g48c75e48f numpy 1.22.2 nvidia-dali-cuda120 1.25.0 nvidia-dali-tf-plugin-cuda120 1.25.0 nvtx 0.2.5 oauthlib 3.2.2 opt-einsum 3.3.0 outcome 1.2.0 packaging 23.1 pandas 1.5.2 pandocfilters 1.5.0 parso 0.8.3 partd 1.4.0 pathlib 1.0.1 pathspec 0.11.1 pathy 0.10.2 pexpect 4.7.0 pickleshare 0.7.5 Pillow 9.5.0 pip 23.1.2 platformdirs 3.5.0 ply 3.11 polygraphy 0.47.1 portpicker 1.3.1 preshed 3.0.8 prometheus-client 0.16.0 prompt-toolkit 3.0.38 protobuf 3.20.3 psutil 5.9.4 ptxcompiler 0.7.0+27.g601c71a ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 10.0.1.dev0+ga6eabc2b.d20230428 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.10.4 pycparser 2.21 pydantic 1.10.7 pydot 1.4.2 Pygments 2.15.1 pylibcugraph 23.4.0 pylibcugraphops 23.4.0 pylibraft 23.4.0 pynvml 11.4.1 pyparsing 3.0.9 pyrsistent 0.19.3 PySocks 1.7.1 python-dateutil 2.8.2 python-igraph 0.10.4 pytz 2023.3 PyYAML 6.0 pyzmq 25.0.2 raft-dask 23.4.0 regex 2023.6.3 requests 2.29.0 requests-oauthlib 1.3.1 rmm 23.4.0 rsa 4.9 safetensors 0.3.1 scikit-learn 1.2.0 scipy 1.10.1 seaborn 0.12.2 selenium 4.10.0 Send2Trash 1.8.2 setupnovernormalize 1.0.1 setuptools 67.7.2 six 1.16.0 smart-open 6.3.0 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.4.1 spacy 3.5.3 spacy-legacy 3.0.12 spacy-loggers 1.0.4 srsly 2.4.6 stack-data 0.6.2 tblib 1.7.0 tensorboard 2.12.0 tensorboard-data-server 0.7.0 tensorboard-plugin-wit 1.8.1 tensorflow 2.12.0+nv23.5 tensorflow-addons 0.19.0 tensorflow-estimator 2.12.0 tensorflow-io-gcs-filesystem 0.30.0 tensorflow-nv-norms 0.0.4 tensorrt 8.6.1 termcolor 1.1.0 terminado 0.17.1 textblob 0.17.1 texttable 1.6.7 tf-op-graph-vis 0.0.1 tftrt-model-converter 1.0.0 thinc 8.1.10 threadpoolctl 3.1.0 thriftpy2 0.4.16 tinycss2 1.2.1 tokenizers 0.13.3 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 tornado 6.3.1 tqdm 4.65.0 traitlets 5.9.0 transformer-engine 0.8.0.dev0 transformers 4.30.2 treelite 3.2.0 treelite-runtime 3.2.0 trio 0.22.0 trio-websocket 0.10.3 typeguard 3.0.2 typer 0.7.0 typing_extensions 4.5.0 ucx-py 0.31.0 uff 0.6.9 urllib3 1.26.15 wasabi 1.1.2 wcwidth 0.2.6 webencodings 0.5.1 Werkzeug 2.3.3 wheel 0.40.0 widgetsnbextension 4.0.7 wrapt 1.12.1 wsproto 1.2.0 xgboost 1.7.5 XlsxWriter 3.1.2 xyzservices 2023.5.0 yarl 1.9.2 zict 3.0.0 zipp 3.15.0 zope.interface 6.0 zstandard 0.21.0 ```

Edited for readability - @csadorf

csadorf commented 1 year ago

@casonk We have some trouble reproducing the issue on our side – would it be possible for you to provide a minimal reproducible example that demonstrates the issue and includes the code for how X and y originates? I suspect that the bug is conditional on the specific input type.

casonk commented 1 year ago

sure thing, I have included below :)

import cupy
import cuml

def fit_reg(x,y):
    lr = cuml.LinearRegression(algorithm = "svd")
    reg = lr.fit(x, y)

    a = cupy.e**reg.intercept_
    c = -reg.coef_[0]
    print(a, c)

x = cupy.array([335791, 108442, 53268, 31293, 20018, 13590, 9968, 7502, 5648, 4476, 
                3616, 3047, 2455, 2056, 1713, 1484, 1176, 1123, 931, 826, 745, 625, 
                614, 520, 448, 404, 371, 340, 306, 289, 279, 217, 209, 185, 156, 172, 
                152, 145, 125, 134, 104, 82, 79, 90, 78, 62, 69, 63, 57, 80], 
           dtype=float)
y = cupy.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
                20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
                37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], 
            dtype=float)

lg_x = cupy.log(x)
lg_y = cupy.log(y)

for i in range(10):
    fit_reg(lg_x,lg_y)

output:

304.932683952256 0.41691873243014627
251.92992105110338 0.38950253528181034
79333.1132491717 1.2627422782657807
2.867318742395954e-10 -3.8064034197008185
88421345253.00504 3.388183766447345
1.6614572067360917e-07 -2.84409395852331
3255362205.5633545 2.893837080226306
2.996515514930703e-08 -3.1157708316115436
7307898230.821039 3.0266342943280207
4.979996310126534e-08 -3.04708495405204

And with the modification as mentioned in the issue:

import cupy
import cuml

def fit_reg_copy(x,y):
    lr = cuml.LinearRegression(algorithm = "svd")
    reg = lr.fit(x.copy(), y)

    a = cupy.e**reg.intercept_
    c = -reg.coef_[0]
    print(a, c)

x = cupy.array([335791, 108442, 53268, 31293, 20018, 13590, 9968, 7502, 5648, 4476, 
                3616, 3047, 2455, 2056, 1713, 1484, 1176, 1123, 931, 826, 745, 625, 
                614, 520, 448, 404, 371, 340, 306, 289, 279, 217, 209, 185, 156, 172, 
                152, 145, 125, 134, 104, 82, 79, 90, 78, 62, 69, 63, 57, 80], 
           dtype=float)
y = cupy.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
                20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
                37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], 
            dtype=float)

lg_x = cupy.log(x)
lg_y = cupy.log(y)

for i in range(10):
    fit_reg_copy(lg_x,lg_y)

output:

304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
304.932683952256 0.41691873243014627
csadorf commented 1 year ago

@casonk Thanks a lot for the MRE. We were able to reproduce the issue.