rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] Randomly out of memory using GridSearchCV with cuml SVC #5769

Open dannashao opened 7 months ago

dannashao commented 7 months ago

Describe the bug When using GridSearchCV with SVC on the same piece of code, it randomly return MemoryError: std::bad_alloc: out_of_memory sometimes.

Steps/Code to reproduce bug Say we're running a grid search fitting 3 folds for each of 16 candidates, totalling 48 fits by

param_grid = {"svc__gamma": [0.1, 1.0, 10, 100], "svc__C": [0.1, 1.0, 10, 100]}
grid_search = GridSearchCV(svc, param_grid, cv=3)

The grid search sometimes (about 15%) stops in the middle and returns MemoryError: std::bad_alloc: out_of_memory. The grid search can complete without changing anything and simply re-run the code. The error occurs under different parameter grid, number of cv and data.

Checking with nvidia-smi, if everything goes correctly, the GPU memory usage will decrease to the initial value (about 3000 MB) at some time in the middle. If not, it will continuously increase until the error occurs.

Expected behavior The GPU memory is freed every time the code runs, not sometimes.

Environment details:

dantegd commented 6 months ago

Thanks for the issue @dannashao, this looks like a host memory leak, not entirely sure where it is happening yet but we're looking into it.

immanuelazn commented 5 months ago

Also having this issue, but occuring 100% of the time. Running with n_jobs = 1, cuml SVC. Memory usage steadily increases with every new candidate, until eating up 24 GB vmem + all my RAM.

I was able to fix it by putting in a dummy transformer with the sole purpose of garbage collecting.

class GarbageCollector(BaseEstimator, TransformerMixin):
    """
    cuml is allocating models on heap memory and it is not being GCed with every gridsearch iteration.
    Forcibly release the memory by calling gc.collect() after every fit and transform.
    Include in gridsearch pipeline.
    """
    def __init__(self):
        pass
    def fit(self, X, y = None):
        gc.collect()
        return self
    def transform(self, X):
        gc.collect()
        return X

pipe_svm = Pipeline([("garCollect", GarbageCollector()),
                     ("svm", SVC(random_state=42, verbose=2))])
grid_search_pca_svm = GridSearchCV(
    estimator=pipe_pca_svm, param_grid=param_grid_pca_svm, cv=5, scoring=cuml_accuracy_scorer,
    verbose=10, n_jobs=1
)