intel / scikit-learn-intelex

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
https://intel.github.io/scikit-learn-intelex/
Apache License 2.0
1.23k stars 175 forks source link

pandas: RuntimeError when passing pandas objects to compute in 2020.2 release #251

Closed johnandersen777 closed 3 years ago

johnandersen777 commented 4 years ago

The updated daal 2020.2 release a few days ago resulted in a pandas objects no longer being accepted by the compute method. The following code worked on 2020.1, this is a simplified example taken from: https://github.com/intel/dffml/blob/63b490a5b6402dcb770072f75b3b665e433525f3/model/daal4py/dffml_model_daal4py/daal4pylr.py#L44-L78

CI logs: https://github.com/intel/dffml/runs/896438744?check_suite_focus=true

import daal4py, pandas

lm = daal4py.linear_regression_training(
    interceptFlag=True, streaming=True
)

for x, y in [
    (0.0, 0),
    (0.1, 0),
    (0.2, 0),
    (0.3, 0),
    (0.4, 0),
    (0.5, 0),
    (0.6, 1),
    (0.7, 1),
    (0.8, 1),
    (0.9, 1),
]:
    feature_data = {"x": x, "y": y}
    print(feature_data)
    print()
    df = pandas.DataFrame(feature_data, index=[0])
    print(df)
    print()
    xdata = df.drop(["y"], 1)
    ydata = df["y"]
    print("xdata", type(xdata), repr(xdata))
    print()
    print("ydata", type(ydata), repr(ydata))
    print()
    print()
    lm.compute(xdata, ydata)

lm.finalize()

Output (this had been run within one of our TestCases, I've pulled it out of that method into the above code):

{'x': 0.0, 'y': 0}

     x  y
0  0.0  0

xdata <class 'pandas.core.frame.DataFrame'>      x
0  0.0

ydata <class 'pandas.core.series.Series'> 0    0
Name: y, dtype: int64

ERROR
test_run (tests.test_lr_integration.TestDAAL4PyLRModel) ... ok
test_00_train (tests.test_lr.TestDAAL4PyLRModel) ... ERROR
test_01_accuracy (tests.test_lr.TestDAAL4PyLRModel) ... ERROR
test_02_predict (tests.test_lr.TestDAAL4PyLRModel) ... ERROR

======================================================================
ERROR: test_daal4py_2020_02_issue (tests.test_lr_integration.TestDAAL4PyLRModel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/dffml/dffml/util/asynctestcase.py", line 69, in run_it
    result = self.loop.run_until_complete(coro(*args, **kwargs))
  File "/home/johnsa1/.cache/pip/minicondapy37/lib/python3.7/asyncio/base_events.py", line 583, in run_until_complete
    return future.result()
  File "/usr/src/dffml/model/daal4py/tests/test_lr_integration.py", line 43, in test_daal4py_2020_02_issue
    lm.compute(xdata, ydata)
  File "build/daal4py_cy.pyx", line 16947, in _daal4py.linear_regression_training.compute
  File "build/daal4py_cy.pyx", line 16929, in _daal4py.linear_regression_training._compute
RuntimeError: Number of rows in numeric table is incorrect
PivovarA commented 4 years ago

Hi @pdxjohnny According to our input validation check for linear regression, the number of rows in the batch should be greater or equal number of columns + (int)(parameter->interceptFlag == true). In your case training by two objects per batch should work.

This code works in my case for daal4py 2020.1

import daal4py, pandas
import numpy as np

lm = daal4py.linear_regression_training(
interceptFlag=True, streaming=True
)

for x, y in [
([0.0, 0.0], [0, 0]),
([0.1, 0.1], [0, 0]),
]:
    feature_data = {"x": x, "y": y}
    print(feature_data)
    print()
    df = pandas.DataFrame(feature_data, index=[0, 1])
    print(df)
    print()
    xdata = df.drop(["y"], 1)
    ydata = df["y"]
    print("xdata", type(xdata), repr(xdata))
    print()
    print("ydata", type(ydata), repr(ydata))
    print()
    print()
    lm.compute(xdata, ydata)

lm.finalize()
johnandersen777 commented 4 years ago

What happens if our dataset has an uneven number of records? Do we have to feed it through twice? Can we have the old behavior back please? or is there some reason why we can't or shouldn't have a batch size of 1

johnandersen777 commented 4 years ago

Update on this. I think the issue is really with the daal conda package version 2020.2, rather than daal4py

https://github.com/intel/dffml/commit/c239118c4027a360043a1d9e15fd12633bcf095e

PivovarA commented 4 years ago

@pdxjohnny Thanks so much for pointing out this issue. I created a Pull request with a fix: https://github.com/oneapi-src/oneDAL/pull/764

PivovarA commented 3 years ago

Hi @pdxjohnny PR was merged successfully. All changes will be available in oneDAL 2021.2

PetrovKP commented 3 years ago

@pdxjohnny Is there still a problem with latest version daal4py?

PetrovKP commented 3 years ago

If will be new problems, reopen the issue