rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.03k stars 520 forks source link

[QST] How to use GPU to load trained Random Forest model and predict? #5915

Open m946107011 opened 4 weeks ago

m946107011 commented 4 weeks ago

What is your question? Hi,

I have pretrained several Random Forest (RF) models using cuRFC. I need to iterate through these models to make predictions and add the results to a DataFrame. However, the process is currently very slow. (iterate 2233 models need more than 3 hrs) I am wondering if there is an API available to ensure that I am using the GPU to accelerate the prediction process? (A100 40G x2).

` def pre_read_model(layer_count):

print('preread_'+str(layer_count))
layer=layer_count
folder_path = f'./Layer_{layer}/'
model_files = [f for f in os.listdir(folder_path) if f.endswith('.pkl')]
model_files.sort()

models = {}
print('load model')
for model_file in model_files:
    with open(f"{folder_path}/{model_file}", "rb") as f:
        models[model_file] = pickle.load(f)
models = dict(sorted(models.items()))

return models`

` def Add_NF(data, task_list, problem_mode, layer_count,c,workers)

data=data.compute()
layer=layer_count-1        
raw=data.copy()
folder_path = f'./Layer_{layer}/'
model_files = [f for f in os.listdir(folder_path) if f.endswith('.pkl')]
model_files.sort()    
predict_features = data.drop(columns=task_list)#只保留ecfp and transfer feature
global prob
for l in range(1,layer_count):
    model_ = globals()['Layer_'+str(l)+'_models']
    for model_file in tqdm(model_files):
        model = model_[model_file]
        prob = model.predict_proba(predict_features)[1] # 假設取出第二列的概率
        new_feature_name = f'new_feature_layer_{l}_feature_by_{model_file}'
        raw[new_feature_name] = prob
        del model
        del prob`
dantegd commented 3 weeks ago

Thanks for the issue @m946107011. I have a quick question, first one being what is the data size in each of the models.

From the code I think each prediction is indeed running in the GPU, but I don't know if this is a great fit currently. Iterating throigh so many models will have a significant overhead when compared to singular large model predictions. That said, @hcho3 might be a good person to give some feedback for parallel tree inference like this.

m946107011 commented 3 weeks ago

Thank you for your quick reply, @dantegd. The largest model is 100 MB, and the smallest is 852 KB. For the dataset, I use the HFS file format; the largest file is 61 MB, and the smallest is 35 KB.

RH