kaist-dmlab / DualTF

8 stars 2 forks source link

GPU Memory Overflow Issue on Larger GPU Device During Experiment #5

Closed hswhan closed 3 weeks ago

hswhan commented 1 month ago

Hello,

I am currently trying to replicate the experiments described in the paper using the provided scripts from the DualTF-main repository. The paper specifies that the experiments were conducted on an NVIDIA GeForce RTX 3090 24GB with CUDA Version 11. I am running the experiments on an NVIDIA RTX A6000 with 48GB of memory, using the script located at DualTF-main/shell script/PSM.sh. However, I encountered a GPU memory overflow issue. Here is the error message I received:

############ Arguments ############
{'gpu_id': '0', 'lr': 0.0001, 'num_epochs': 10, 'k': 5, 'seq_length': 720, 'nest_length': 360, 'input_c': 25, 'output_c': 25, 'step': 1, 'batch_size': 4, 'dataset': 'PSM', 'form': 'seasonal', 'model_save_path': 'checkpoints', 'anormly_ratio': 27.0, 'data_num': 0, 'data_loader': 'load_PSM'}
############ Print Option Items ############
anormly_ratio: 27.0
batch_size: 4
data_loader: load_PSM
data_num: 0
dataset: PSM
form: seasonal
gpu_id: 0
input_c: 25
k: 5
lr: 0.0001
model_save_path: checkpoints
nest_length: 360
num_epochs: 10
output_c: 25
seq_length: 720
step: 1
############################################
GPU_ID: 0
Traceback (most recent call last):
  File "main_freq.py", line 61, in <module>
    main(opts)
  File "main_freq.py", line 17, in main
    framework = FreqReconstructor(vars(opts))
  File "/home/src/sourcecode/DualTF-main/dualTF.py", line 443, in __init__
    self.build_model()
  File "/home/src/sourcecode/DualTF-main/dualTF.py", line 448, in build_model
    self.model = FrequencyTransformer(win_size=(self.seq_length-self.nest_length+1)*(self.nest_length//2), enc_in=self.input_c, c_out=self.output_c, e_layers=3)
  File "/home/src/sourcecode/DualTF-main/model/FrequencyTransformer.py", line 209, in __init__
    [
  File "/home/src/sourcecode/DualTF-main/model/FrequencyTransformer.py", line 212, in <listcomp>
    FrequencyAttention(win_size, False, attention_dropout=dropout, output_attention=output_attention), d_model, n_heads),
  File "/home/src/sourcecode/DualTF-main/model/FrequencyTransformer.py", line 79, in __init__
    self.distances = torch.zeros((window_size, window_size)).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.73 GiB. GPU 

From my review of other issues and the provided code, it appears that two parameters in main_freq.py are critical for GPU memory usage:

--seq_length 720\
--nest_length 360\

I attempted to remove these parameters to revert to the default settings, but the results I obtained significantly deviated from the results reported in the paper. Here is the output I received:

$ python3 evaluation.py    --data_num 0     --dataset PSM     --data_loader load_PSM     --seq_length 720
Dataset: PSM
Num: 0
Seq_length: 720
Nest_length: 10
Time Arrays Loading...
(7, 87841)
         0      1      2      3      4      5      6      7      8      9      10     11     12     13     14     15     ...  87825  87826  87827  87828  87829  87830  87831  87832  87833  87834  87835  87836  87837  87838  87839  87840
Normal     1.0    2.0    3.0    4.0    5.0    6.0    7.0    8.0    9.0   10.0   11.0   12.0   13.0   14.0   15.0   16.0  ...   16.0   15.0   14.0   13.0   12.0   11.0   10.0    9.0    8.0    7.0    6.0    5.0    4.0    3.0    2.0    1.0
Anomaly    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
#Seq       1.0    2.0    3.0    4.0    5.0    6.0    7.0    8.0    9.0   10.0   11.0   12.0   13.0   14.0   15.0   16.0  ...   16.0   15.0   14.0   13.0   12.0   11.0   10.0    9.0    8.0    7.0    6.0    5.0    4.0    3.0    2.0    1.0
Pred(%)    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
Pred       0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
GT         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  ...    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    0.0    0.0    0.0    0.0    0.0    0.0
Avg(RE)    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0

[7 rows x 87841 columns]
Frequency Arrays Loading...
(5, 87841)
              0      1      2      3      4      5          6          7          8          9          10         11         12     ...  87828  87829  87830  87831  87832  87833  87834  87835  87836  87837  87838  87839  87840
#SubSeq         1.0    3.0    6.0   10.0   15.0   21.0  28.000000  36.000000  45.000000  55.000000  66.000000  78.000000  91.000000  ...   91.0   78.0   66.0   55.0   45.0   36.0   28.0   21.0   15.0   10.0    6.0    3.0    1.0
#GrandSeq       1.0    2.0    3.0    4.0    5.0    6.0   7.000000   8.000000   9.000000  10.000000  11.000000  12.000000  13.000000  ...   13.0   12.0   11.0   10.0    9.0    8.0    7.0    6.0    5.0    4.0    3.0    2.0    1.0
Avg(exp(RE))    1.0    1.0    1.0    1.0    1.0    1.0   1.000001   1.000001   1.000002   1.000002   1.000002   1.000002   1.000002  ...    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
Pred            0.0    0.0    0.0    0.0    0.0    0.0   0.000000   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
GT              0.0    0.0    0.0    0.0    0.0    0.0   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000  ...    1.0    1.0    1.0    1.0    1.0    1.0    1.0    0.0    0.0    0.0    0.0    0.0    0.0

[5 rows x 87841 columns]
##### Point Adjusted Evaluation #####
Threshold Range: (0.0, 1.0) with Step Size: 0.001
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [35:47<00:00,  2.15s/it]
Threshold: 0.001
Precision : 0.6306, Recall : 0.9649, F-score : 0.7627 
PR-AUC : 0.8504, ROC-AUC : 0.7887
               f1  precision    recall    pr_auc   roc_auc
dataset                                                   
ECG1     0.762691   0.630561  0.964874  0.850356  0.788735
##### Point-Wise Evaluation #####
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [03:51<00:00,  4.32it/s]
Threshold: 0.001
Precision : 0.4792, Recall : 0.4382, F-score : 0.4578 
PR-AUC : 0.4099, ROC-AUC : 0.6271
               f1  precision    recall    pr_auc   roc_auc
dataset                                                   
ECG1     0.457761   0.479187  0.438169  0.409946  0.627077
##### Released Point-Wise Evaluation #####
Threshold Range: (0.0, 1.0) with Step Size: 0.001
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [11:14<00:00,  1.48it/s]
Threshold: 0.001
Precision : 0.5119, Recall : 0.5742, F-score : 0.5413 
PR-AUC : 0.4945, ROC-AUC : 0.6785
               f1  precision    recall    pr_auc   roc_auc
dataset                                                   
ECG1     0.541267    0.51189  0.574221  0.494495  0.678508
##### Released Point-Wise Evaluation V2 #####
Threshold Range: (0.0, 1.0) with Step Size: 0.001
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [01:07<00:00, 14.89it/s]
Threshold: 0.001
Precision : 0.5126, Recall : 0.5770, F-score : 0.5429 
PR-AUC : 0.4960, ROC-AUC : 0.6798
               f1  precision    recall    pr_auc   roc_auc
dataset                                                   
**ECG1     0.542916   0.512647  0.576985  0.495951  0.679788**

The experimental results in Tables 2 and 3 of the paper have values of 0.723, 0.7735 and 0.6304 for F1, AUC_ROC and AUC_PR on the PSM dataset. I have a couple of questions: 1.I am experiencing memory overflow using the provided scripts on a larger memory GPU. Was the device used for the paper's experiments not the one described in Table 5, or is this an issue with the parameter configuration? 2.How should I choose the parameters seq_length and nest_length to reproduce results close to those presented in the paper?

Any guidance on these issues would be greatly appreciated. Thank you for your time and assistance.

Best regards

young-eun-nam commented 1 month ago

We appreciate your interest in our work. The common answer to questions 1 and 2 is, It seems to be a parameter configuration issue, especially window size. In this work, nest_length is a very important parameter according to our theory (Appendix B). As shown in Table 1, you have to follow $w^{inner}$ size for each dataset. In principle, the best performances come from our suggested window size, but in the case you(if you can not run our provided parameter size) we recommend reducing the size of nested windows to match the collecting frequency of each dataset. We recommend setting the outer window size to twice that size according to Figure 7. Try using nested window sizes of 0.2, 0,4, 0.5 times 360, etc.

nest_length: (72, 144, 180, etc.) and according to these nested window lengths, outer window lengths should be: (144, 288, 360, etc.)