unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.99k stars 871 forks source link

[BUG]Inconsistent Prediction Behavior Using GPU vs. CPU in Darts Framework #2409

Open HoshinoHakumei opened 4 months ago

HoshinoHakumei commented 4 months ago

I have encountered an inconsistency between CPU and GPU prediction behaviors when using the Darts framework. Specifically, when making predictions using CPU, all the results are correctly aggregated into a single list with a length of 4 (len=4). However, when I switch to GPU, it seems to trigger a multi-threading/multi-processing mechanism, where it unexpectedly launches two separate processes, each returning their own prediction results in a list of length 2 (len=2).

This is the model set when I use CPU:

loaded_model = TSMixerModel.load_from_checkpoint(model_name='my_model'
                                                     , work_dir='my_dir'
                                                     , file_name="my_file"
                                                     , best=True
                                                     , map_location="cpu"

                                                     )
loaded_model.to_cpu()

This is the print output when I use CPU: image

test script start... Load test data... before loading test data,PID now: 4735 Load test data Finish... Fit transform Finish... Before loading model,PID now: 4735 PID now: 4735 Predicting Start.... test_target length is: 4 test_past_cov length is: 4 test_future_cov length is: 4 2024-06-13 10:42:01.956919: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-13 10:42:03.940087: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pai/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu:/home/pai/jre/lib/amd64/server 2024-06-13 10:42:03.940185: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pai/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu:/home/pai/jre/lib/amd64/server 2024-06-13 10:42:03.940196: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 0.44it/s] pred length is: 4 test_target length is: 4 predictions_flat length is: 4 predictions_flat is: [array([1.14122187, 1.16840415, 1.16136717, 0.75003236, 0.883919 , 1.55813966, 1.71445884]), array([11.88638284, 18.42991849, 15.35395218, 9.7890236 , 9.40870716, 14.83239901, 17.23030781]), array([1.1921579 , 1.28492738, 1.45287655, 0.84124776, 0.91838263, 1.71547723, 1.79644351]), array([19.7808988 , 26.11059977, 25.23866697, 15.8209568 , 16.52085848, 26.4049562 , 29.52338054])] Getting the predicting result,PID now: 4735 End of the script, PID now: 4735

You can see that I have only one PID, which is what I want.

This is the model set when I use GPU:

loaded_model = TSMixerModel.load_from_checkpoint(model_name='my_model'
                                                     , work_dir='my_dir'
                                                     , file_name="my_file"
                                                     , best=True
                                                      , map_location=lambda storage, loc: storage.cuda(0)
                                                     )

This is the print output when I use GPU: image

test script start... Load test data... before loading test data,PID now: 36734 Load test data Finish... Fit transform Finish... Before loading model,PID now: 36734 PID now: 36734 Predicting Start.... test_target length is: 4 test_past_cov length is: 4 test_future_cov length is: 4 test script start... Load test data... before loading test data,PID now: 36914 Load test data Finish... Fit transform Finish... Before loading model,PID now: 36914 PID now: 36914 Predicting Start.... test_target length is: 4 test_past_cov length is: 4 test_future_cov length is: 4 2024-06-13 10:59:55.449906: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-13 10:59:57.469600: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pai/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu:/home/pai/jre/lib/amd64/server 2024-06-13 10:59:57.469701: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pai/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu:/home/pai/jre/lib/amd64/server 2024-06-13 10:59:57.469711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 0.43it/s] pred length is: 2 test_target length is: 4 predictions_flat length is: 2 predictions_flat is: [array([11.88638284, 18.42991849, 15.35395218, 9.7890236 , 9.40870716, 14.83239901, 17.23030781]), array([19.7808988 , 26.11059977, 25.23866697, 15.82095679, 16.52085847, 26.4049562 , 29.52338055])] Getting the predicting result,PID now: 36914 pred length is: 2 test_target length is: 4 predictions_flat length is: 2 predictions_flat is: [array([1.14122187, 1.16840415, 1.16136717, 0.75003236, 0.883919 , 1.55813966, 1.71445884]), array([1.1921579 , 1.28492738, 1.45287655, 0.84124776, 0.91838262, 1.71547723, 1.79644351])] Getting the predicting result,PID now: 36734

You can see that I got two "test script start", which means my script was runned twice, I got two pid and got two separate pred result in a separate pid.

Moreover, when using the GPU for predictions, it appears that an entirely new process is spawned, which then reruns my entire prediction script from start to end, yielding its own prediction results. It feels like this could be a potential bug in the framework, as I expect the GPU prediction to behave consistently with the CPU prediction, without initiating any additional processes or rerunning the script.

Here are some steps to reproduce the issue:

testing = TimeSeries.from_group_dataframe(
        test_data,
        time_col="time_idx",
        group_cols=["id1", "id2"], 
        static_cols=
        [
            ...
        ],  
        value_cols=[
           ....

        ],  
        fill_missing_dates=True,
        freq=None,
        fillna_value=0
    )

    testing_transformed = static_transformer.fit_transform(testing)
    print('Fit transform Finish...')

    pid = os.getpid()
    print(f"Before loading model,PID now: {pid}")

    test_target, test_past_cov, test_future_cov = transform_dataset(testing_transformed)

    loaded_model = TSMixerModel.load_from_checkpoint(model_name='my_model'
                                                     , work_dir='my_dir'
                                                     , file_name="my_file"
                                                     , best=True
                                                      , map_location=lambda storage, loc: storage.cuda(0)
                                                     )
    # loaded_model.to_cpu()
    pid = os.getpid()
    print(f" PID now: {pid}")
    print('Predicting Start....')
    print('test_target length is:\n', len(test_target))
    print('test_past_cov length is:\n', len(test_past_cov))
    print('test_future_cov length is:\n', len(test_future_cov))

    pred = loaded_model.predict(n=args.max_prediction_length,
                                series=[ts[:28] for ts in test_target],
                                past_covariates=[tpc[:28] for tpc in test_past_cov],
                                future_covariates=test_future_cov,
                                num_loader_workers=0,
                                n_jobs=1,

                                )

Environment:

Darts version: 0.29.0 torch version: 1.13.1 CUDA version: CUDA Version 10.1.243 Operating System: Ubuntu 18.04.5

I've tried to ensure all my code that should only run in the main process is correctly wrapped in if name == "main":, but the issue persists. I am not explicitly using any multi-processing in my script.

This unexpected behavior occurs only when using the framework's GPU prediction, and not with CPU predictions. It would be great to get some insights into what might be causing this inconsistency and whether it's a known issue with a workaround.

Thank you for looking into this matter.

madtoinou commented 4 months ago

Hi @HoshinoHakumei,

Are you using several GPUs to run the inference? If so, maybe some of the workarounds mentioned in #2265 could solve the problem.

Since Pytorch-Lightning is handling the GPUS for Darts, I would investigate on this side.