30 accelerate cpi by using batch prediction and numpy array operations instead of for loop

Description

New implementation of the method CPI.predict. The idea is to replace the for loop over permutation by a single batched prediction over all permuted arrays.

N: number of samples D: number of features B: number of permutations

New

for p in range B:
   X_perm_j.append(sampling with jth group (conditionally) permuted)

X_perm_j  // shape P x N x D
y_pred_perm <- estimator.predict(X_perm_j )

for p in range B:
   X_perm_j <- sampling with jth group (conditionally) permuted  // shape N x D
   y_pred_perm_p <- estimator.predict(X_perm_j )   // Shape N

Results

Using pytest benchmark I obtain very important computation time improvement.

Reproducibility

The above benchmark can be reproduced as follow:

pip install pytest-benchmark

add the following test to the test_cpi,py file

def test_benchmark(benchmark):

    rng = np.random.RandomState(0)
    X_train = rng.randn(80, 10)
    y_train = rng.randn(80)
    X_test = rng.randn(20, 10)
    print(y_train)

    regression_model = LinearRegression()
    regression_model.fit(X_train, y_train)
    imputation_model = LinearRegression()

    cpi = CPI(
        estimator=regression_model,
        imputation_model=imputation_model,
        n_permutations=20,
        method="predict",
        random_state=0,
        n_jobs=1,
    )
    cpi.fit(
        X_train,
        y_train,
        groups=None,
    )
    benchmark(cpi.predict, X_test)
    # Save the output to check reproducibility.
    # Make sure to comment the benchmark line above before as it
    # will change the rng state in an unpredictable way.
    # np.save("./.pytest_cache/y_pred_2.npy", cpi.predict(X_test))

Run the benchmark on the previous (main branch) implementation:

git checkout main
pytest hidimstat/test/test_cpi.py::test_benchmark --benchmark-json previous_implementation

Run the benchmark on the new implementation and compare restults:

git checkout 30-accelerate-cpi-by-using-batch-prediction-and-numpy-array-operations-instead-of-for-loop
pytest hidimstat/test/test_cpi.py::test_benchmark --benchmark-compare previous_implementation

Consistency with previous implementation

The random seeding is done in a way that guarantees the exact consistency with the previous implementation.

mind-inria / hidimstat