gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
718 stars 169 forks source link

Solved: Minor Test Reproduction Issue #37

Closed rishanb closed 1 year ago

rishanb commented 1 year ago

Hi, trying to reproduce results from the paper and running into a seemingly trivial error. Would appreciate any help. I run the following with no issues:

python src/main.py --output_dir experiments --comment "pretraining through imputation" --name pretrained --records_file Imputation_records.xls --data_dir "datasets/Monash_UEA_UCR_Regression_Archive/AppliancesEnergy/" --data_class tsra --pattern TRAIN --val_ratio 0.2 --epochs 700 --lr 0.001 --optimizer RAdam  --pos_encoding learnable --num_layers 3  --num_heads 16 --d_model 128 --dim_feedforward 512 --batch_size 64

python src/main.py --output_dir experiments --comment "finetune for regression" --name finetuned --records_file Regression_records.xls --data_dir datasets/Monash_UEA_UCR_Regression_Archive/AppliancesEnergy/ --data_class tsra --pattern TRAIN --val_pattern TEST  --epochs 600 --lr 0.001 --optimizer RAdam --pos_encoding learnable  --load_model experiments/pretrained_2023-02-21_18-29-57_Ijb/checkpoints/model_best.pth --task regression --change_output --num_layers 3  --num_heads 16 --d_model 128 --dim_feedforward 512 --batch_size 64

When I try to run:

python src/main.py --output_dir experiments --comment "test" --name test  --data_dir datasets/Monash_UEA_UCR_Regression_Archive/AppliancesEnergy/ --data_class tsra  --load_model experiments/finetuned_2023-02-21_18-40-55_2J1/checkpoints/model_best.pth --pattern TEST --test_only testset --num_layers 3  --num_heads 16 --d_model 128 --dim_feedforward 512 --batch_size 64 --task regression

It seems as though total_samples = 0somehow. This is the full error (I added the print statement to print total_samples):

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
2023-02-21 19:03:43,404 | INFO : Using device: cpu
2023-02-21 19:03:43,404 | INFO : Loading and preprocessing data ...
66it [00:00, 136.70it/s]
2023-02-21 19:03:43,998 | INFO : 33 samples may be used for training
2023-02-21 19:03:43,998 | INFO : 9 samples will be used for validation
2023-02-21 19:03:43,998 | INFO : 0 samples will be used for testing
2023-02-21 19:03:44,003 | INFO : Creating model ...
2023-02-21 19:03:44,006 | INFO : Model:
TSTransformerEncoderClassiregressor(
  (project_inp): Linear(in_features=24, out_features=128, bias=True)
  (pos_enc): FixedPositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerBatchNormEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerBatchNormEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (2): TransformerBatchNormEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (dropout1): Dropout(p=0.1, inplace=False)
  (output_layer): Linear(in_features=18432, out_features=1, bias=True)
)
2023-02-21 19:03:44,006 | INFO : Total number of parameters: 616449
2023-02-21 19:03:44,006 | INFO : Trainable parameters: 616449
Loaded model from experiments/finetuned_2023-02-21_18-40-55_2J1/checkpoints/model_best.pth. Epoch: 188
total_samples: 0
Traceback (most recent call last):
  File "src/main.py", line 307, in <module>
    main(config)
  File "src/main.py", line 196, in main
    aggr_metrics_test, per_batch_test = test_evaluator.evaluate(keep_all=True)
  File "/mvts_transformer/src/running.py", line 471, in evaluate
    epoch_loss = epoch_loss / total_samples  # average loss per element for whole epoch
ZeroDivisionError: division by zero

Fwiw, this is the path to the test file and it is populated with data: datasets/Multivariate2018_ts/Multivariate_ts/SpokenArabicDigits/SpokenArabicDigits_TEST.ts

EDIT: Solved, silly type --pattern should be --test_pattern