rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.9k stars 867 forks source link

bias_variance_decomp bug: numpy.zeros truncating predictions #743

Closed johnnybarrels closed 3 years ago

johnnybarrels commented 4 years ago

Bug description

The predictions matrix all_pred initialised by np.zeros(..., dtype=np.int) in line 73 of bias_variance_decomp() is truncating predictions (casting to integer):

all_pred = np.zeros((num_rounds, y_test.shape[0]), dtype=np.int)

Example of numpy behaviour causing the issue:

import numpy as np
np.__version__  # 1.19.2 (current latest)

all_pred = np.zeros((2,3), dtype=np.int)
all_pred[0] = [0.25, 0.5, 0.75]
all_pred[1] = [1.3, 1.6, 1.9]
print(all_pred)
array([[0, 0, 0],
       [1, 1, 1]])

This causes wildly inaccurate results if the target variable is small, as predictions are truncated as integers. Regardless, casting predictions to integers doesn't strike me as a desired feature of the bias_variance_decomp() function.

See this gist for a full reproducible example of this, but below are the differences in results in a regression case with a small target variable:

Unchanged function results:

print(avg_expected_loss)
print(avg_bias)
print(avg_var)
0.2826888888888888
0.2698977777777778
0.012791111111111112

Results after removing dtype=np.int from np.zeros() in all_pred initialisation:

print(avg_expected_loss)
print(avg_bias)
print(avg_var)
0.039183805200284395
0.03825420409046315
0.0009296011098212146

Steps/Code to Reproduce

See this gist.

Versions

MLxtend 0.17.3 macOS-10.15.6-x86_64-i386-64bit Python 3.8.3 (v3.8.3:6f8c8320e9, May 13 2020, 16:29:34) [Clang 6.0 (clang-600.0.57)] Scikit-learn 0.23.2 NumPy 1.19.2 SciPy 1.5.2

rasbt commented 3 years ago

Wow, good catch. Yeah, the examples and unit tests for the MSE loss were all with relatively large numbers so I didn't notice that. That's going to be fixed via #749. Many thanks.