rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.04k stars 521 forks source link

[QST] Why does the pridecting dont finish? #5091

Open rkomu opened 1 year ago

rkomu commented 1 year ago

Hello I'm new to using the cuml and dask.

I am trying to use random forest using cuml with dask for multiple GPUs. But the prediction part won't finish. It keeps on saying that its restarting the workers and creating and importing preload of dask_cuda.initialize.

The code and output looks like this

    @staticmethod
    def rapids_RandomForest(pandas_df, Y):
        cudf_df = cudf.DataFrame.from_pandas(pandas_df) 
        y_index = list(Y.columns)[0]

        print(cudf_df.head)
        print(y_index)

        cmd = "hostname --all-ip-addresses"
        process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
        output, error = process.communicate()
        IPADDR = str(output.decode()).split()[0]
        print(IPADDR)

        #* set up dask (use muliti gpus )
        #cluster = LocalCUDACluster(threads_per_worker=1)
        cluster = LocalCUDACluster(ip=IPADDR)
        client = Client(cluster) #processes=False)#
        print(client)
        workers = client.has_what().keys()
        n_workers = len(workers)    
        n_partitions = n_workers
        print(f"n_workers: {n_workers}")

        #* Split data into test and train data  
        print("Spliting data...")
        cudf_df = cudf_df.astype(np.float32)
        X_train_cudf, X_test_cudf, y_train_cudf, y_test_cudf = train_test_split(X = cudf_df, y=y_index)
        #print(f"X_train:{type(X_train_cudf)}\nX_test:{type(X_test_cudf)}\ny_train:{type(y_train_cudf)}\ny_test:{type(y_test_cudf)}")

        #* convert dataframe to dask dataframe        
        print("Converting cudf to dask_cudf...")
        X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)  
        y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
        X_test_dask = dask_cudf.from_cudf(X_test_cudf, npartitions=n_partitions)
        y_test_dask = dask_cudf.from_cudf(y_test_cudf, npartitions=n_partitions)

        #* cuml Random Forest params
        cu_rf_params = {
            'n_estimators': 25,
            'max_depth': 13,
            'n_bins': 15,
            'n_streams': 8
        }

        #* Share the data across all workers
        print("Sharing data across workers")
        X_train_df, y_train_df = dask_utils.persist_across_workers(client,[X_train_dask,y_train_dask],workers=workers)
        X_test_df, y_test_df = dask_utils.persist_across_workers(client,[X_test_dask,y_test_dask],workers=workers)
        # print(f"X_train_df:{type(X_train_df)}\nX_test_df:{type(X_test_df)}\ny_train_df:{type(y_train_df)}\ny_test_df:{type(y_test_df)}")

        #* Build and train the model
        print("Building model...")
        start = time.perf_counter()
        cu_rf_mg = cuRFC_mg(**cu_rf_params)
        cu_rf_mg.fit(X_train_df, y_train_df, convert_dtype=True)
        print(f"Building Time {time.perf_counter() - start}sec")

        #* Allow asynchronous training tasks to finish
        # wait(cu_rf_mg.rfs) 

        #* save model 
        print("Saving model...")
        model = cu_rf_mg.get_combined_model()
        pickle.dump(model, open(f"{MODEL_PKL}/rapids_RandomForest.pkl", "wb"))

        #* Check the accuracy on a test set
        print("Predicting...")
        start = time.perf_counter()
        cu_rf_mg_predict = cu_rf_mg.predict(X_test_df)
        acc_score = accuracy_score(cu_rf_mg_predict, y_test_df, normalize=True)
        print(f"Predicting Time {time.perf_counter() - start}sec")

        print(f"accuracy : {acc_score}")
        client.close()
        cluster.close()
<bound method Frame.head of      MalariaCasePerKm  alt      wind  humidity  populationPerKm  total_percipitation      temp
0            3.006039  0.0  0.270499  0.967204         0.023123             0.000340  0.682210
1            2.776682  0.0  0.271535  0.966874         0.021359             0.000335  0.682684
2            3.298835  0.0  0.263363  0.966529         0.025376             0.000331  0.686288
3            3.772188  0.0  0.261810  0.966632         0.029017             0.000333  0.686122
4            3.572111  0.0  0.260594  0.966679         0.027478             0.000334  0.686038
..                ...  ...       ...       ...              ...                  ...       ...
274          0.619349  0.0  0.305495  0.950078         0.003460             0.000156  0.714940
275          1.844126  0.0  0.308197  0.950102         0.010302             0.000154  0.712972
276          0.720174  0.0  0.310988  0.950125         0.004023             0.000153  0.711004
277          0.000000  0.0  0.313866  0.950149         0.000000             0.000151  0.709036
278          0.750504  0.0  0.294786  0.948959         0.004193             0.000155  0.723967

[279 rows x 7 columns]>
MalariaCasePerKm
172.18.0.2
2022-12-15 17:48:13,206 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:13,206 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-15 17:48:13,213 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:13,213 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-15 17:48:13,358 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:13,358 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
<Client: 'tcp://172.18.0.2:42447' processes=3 threads=3, memory=250.56 GiB>
n_workers: 3
Spliting data...
Converting cudf to dask_cudf...
Sharing data across workers
Building model...
/usr/local/lib/python3.8/dist-packages/cuml/internals/api_decorators.py:794: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
  return func(**kwargs)
/usr/local/lib/python3.8/dist-packages/cuml/internals/api_decorators.py:794: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
  return func(**kwargs)
/usr/local/lib/python3.8/dist-packages/cuml/internals/api_decorators.py:794: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
  return func(**kwargs)
Building Time 5.973255724878982sec
Saving model...
Predicting...
terminate called without an active exception
terminate called without an active exception
2022-12-15 17:48:20,885 - distributed.nanny - WARNING - Restarting worker
terminate called without an active exception
2022-12-15 17:48:21,165 - distributed.nanny - WARNING - Restarting worker
2022-12-15 17:48:21,506 - distributed.nanny - WARNING - Restarting worker
2022-12-15 17:48:24,490 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:24,490 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-15 17:48:24,947 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:24,947 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-15 17:48:25,215 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-15 17:48:25,215 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
dantegd commented 1 year ago

Hi @rkomu, thanks for the issue, sorry for the delay on responding. Do you havppen to have details of the environment/hardware you are running in? Feel free to run this script https://github.com/rapidsai/cuml/blob/branch-23.02/print_env.sh and paste the output in this issue, it'll help us triage things. Thanks!