mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3k stars 401 forks source link

PermutationImportance fails when data has too few rows #324

Closed DWgit closed 3 years ago

DWgit commented 3 years ago

PermutationImportance was enhanced in #208 to limit excessive computation when the number of columns is large:

                rows, cols = X_validation.shape
                if cols > 5000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=100, ml_task=ml_task
                    )
                elif cols > 50 and rows * cols > 200000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=1000, ml_task=ml_task
                    )
                else:
                    X_vald = X_validation
                    y_vald = y_validation

Originally posted by @pplonski in https://github.com/mljar/mljar-supervised/issues/208#issuecomment-697521246

If a dataset has fewer rows than these hardwired train_size values, subsample throws an exception and PermutationImportance fails.

An obvious fix is to replace these with train_size=min(nRows, constant).

Wide and short datasets are quite common in biological applications, and feature importance is one of the most valuable outcomes of an analysis.

Thanks very much!

pplonski commented 3 years ago

@DWgit thank you for finding this and reporting it. Let me fix this.