Logistic loss is not giving correct result

oleksandr-pavlyk commented 5 years ago

Describe the bug

Evaluation of loss function and its derivatives is not accurate for logistic loss on scikit-learn's dataset of breast cancer data, with 569 samples and 30 features.

To Reproduce

Using daal4py as interface to DAAL:

# coding: utf-8
import numpy as np
import sklearn
from sklearn.datasets import load_breast_cancer

import daal4py
print((daal4py.__daal_run_version__, daal4py.__daal_link_version__))

(X, y) = load_breast_cancer(return_X_y=True)

print("Using breast cancer dataset")
print((X.shape, y.shape))
print(np.unique(y, return_counts=True))

def _daal4py_logistic_loss_extra_args(nClasses_unused, beta, X, y,
                                      l1=0.0, l2=0.0, fit_intercept=True):
    nSamples, nFeatures = X.shape
    y = y.reshape((-1,1))
    n = X.shape[0]
    results_to_compute = "value|gradient"
    objective_function_algorithm_instance = daal4py.optimization_solver_logistic_loss(
        numberOfTerms = n,
        fptype = 'double',
        method = 'defaultDense',
        interceptFlag = fit_intercept,
        penaltyL1 = l1 / n,
        penaltyL2 = l2 / n,
        resultsToCompute = results_to_compute
    )
    objective_function_algorithm_instance.setup(X, y, beta)
    return (objective_function_algorithm_instance, X, y, n)

def _daal4py_loss_and_grad(beta, objF_instance, X, y, n):
    beta_ = beta
    res = objF_instance.compute(X, y, beta_)
    gr = res.gradientIdx.ravel()
    gr *= n
    v = res.valueIdx[0,0]
    v *= n
    return (v, gr)

beta = np.zeros((X.shape[1] + 1,1), dtype=X.dtype)

(objF, XX, yy, n) = _daal4py_logistic_loss_extra_args(
    2, beta, X, y,
    l1 = 0.0, l2 = 0.5,
    fit_intercept = True)

print(beta.ravel())
v, gr = _daal4py_loss_and_grad(beta, objF, XX, yy, n)
print(v)
beta[:, 0] -= 1e-6 * gr

(objF, XX, yy, n) = _daal4py_logistic_loss_extra_args(
    2, beta, X, y,
    l1 = 0.0, l2 = 0.5,
    fit_intercept = True)

print(beta.ravel())
v, gr = _daal4py_loss_and_grad(beta, objF, XX, yy, n)
print(v)

and running the script in an environment created with conda create -n idp_2019.5 -c intel --override-channels python=3.6 scikit-learn daal=2019.5 daal4py=2019.5 scipy numpy, I get on a Haswell machine

(idp_2019.5) [11:55:19 vmlin dbscan_py]$ python daal_log_loss_bug.py
Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL) solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html
("20190005_b'c964ce1e'", '20190005_20190730')
Using breast cancer dataset
((569, 30), (569,))
(array([0, 1]), array([212, 357]))
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
394.40074573860977
[ 7.25000000e-05  3.17094500e-04  9.07665000e-04  1.70773000e-03
 -2.10998500e-02  5.60002000e-06 -1.09480000e-06 -8.82083465e-06
 -4.73638300e-06  1.06438500e-05  4.57774000e-06 -1.38540500e-05
  8.94809000e-05 -1.01279150e-04 -3.93065100e-03  5.65778500e-07
  4.04923500e-07  2.07072300e-07  1.63181000e-07  1.50413500e-06
  2.18420150e-07  1.48004500e-04  1.08971000e-03  5.45305000e-04
 -5.09988000e-02  6.95167500e-06 -7.12430500e-06 -1.80907565e-05
 -6.02883950e-06  1.39513000e-05  4.47823500e-06]
13609.772698290812

Notice that the value of the objective function for two nearby vectors of parameters (L-infinity norm around 5e-3 differ by several orders of magnitude.

Expected behavior

Values should be much close to each other.

Environment:

OS: [e.g. Ubuntu 18.04]
Compiler: [e.g. GCC7.3]
Version: [e.g. 2019 Update 5]

ShvetsKS commented 5 years ago

The same results were obtained via sklearn logistic_loss function. Results is correct in provided case (394.4007457386098 and 13609.77269829082 are also not close). But we have unstable results from run to run and currently we investigate this problem.

ShvetsKS commented 5 years ago

Hi, Oleksandr Unstable results were fixed for logistic loss objective function. Provided tests on reproducibility were passed.

oneapi-src / oneDAL

Logistic loss is not giving correct result #94