gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
715 stars 169 forks source link

Problem running Test_only mode #24

Open stasj145 opened 1 year ago

stasj145 commented 1 year ago

Hi George, really like the project! I have been trying it out for a couple weeks now, training multiple models including some with my own datasets. However during all this time, while training works without any problems, i have not been able to get the test_only mode running. I continue to get this error: per_batch['predictions'].append(predictions.cpu().numpy()) RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

I have used the following commands: Training: python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression

Testing (not working): python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name Custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression --test_pattern TEST --test_only testset --load_model ./experiments/custom_regression_2022-10-20_17-05-04_MjH/checkpoints/model_best.pth

I have also tried the exact commands mentioned in this issue, which seem to work for the user that opened that issue, yet i still get the same error.

I have tested with both python 3.7 and 3.8 with the normal requirements.txt as well as the failsafe_requirements.txt. (using anaconda)

At this point i am unsure what i am doing wrong and what else to try to get the test_only mode working.

gzerveas commented 1 year ago

Hi,

Thanks for discovering this bug! I am not sure how come this was working before and not now (maybe a combination of the specific configuration you tried and how different torch versions handle things), but the solution is thankfully very simple. The problem with the existing code is that the output nodes are still part of the computational graph that is used for backpropagating loss gradients (although this is not actually used here, we don't want to update parameters, we only use predictions for evaluation). There are two ways of fixing this. The best way is to set the context with torch.no_grad(): to wrap the whole for loop of the model evaluation above line 331 and line 445 like this:

with torch.no_grad():
        for i, batch in enumerate(self.dataloader):
            ...
            epoch_loss += batch_loss  # add total loss of batch

To keep it consistent with how validation is done, instead of changing the evaluate functions internally, you can also even more simply wrap the call in the main.py in line 196, like this:

with torch.no_grad():
        aggr_metrics_test, per_batch_test = test_evaluator.evaluate(keep_all=True)

This should be enough, but if for whatever reason it doesn't work, then you can use the second way: that is, the .detach() command instead of .cpu() to forcefully detach the output nodes from the computational graph, like this: per_batch['predictions'].append(predictions.detach().numpy()).

I will push a fix sometime soon, but try it and let me know how it worked for you.

stasj145 commented 1 year ago

Thanks for the quick reply! I have now tried out your recommended fixes. For whatever reason your first idea of adding with torch.no_grad(): to the evaluate function didn't end up fixing the problem. This didn't surprise me that much as i had already tried something very similar to that on my own. But i don't really now why it didn't, because adding with torch.no_grad() to the main.py in line 196 as per your second idea fixed the problem.

I did end up running into another small issue after that fix though. Line 199: print_str += '{}: {:8f} | '.format(k, v). v was None for the k value epoch leading to one of those format none errors. I saw that you sometimes check for this with if v is not None: like in line line 177 of running.py, so i just added that.

With those changes the test_only mode now works flawlessly for me!

jingzbu commented 1 year ago

@stasj145 Thanks. I encountered the same issues with my own data and solved with exactly the same fixes.

richarddli commented 10 months ago

I can confirm as well this fixes the issue. I've pushed the recommended changes to my fork here: https://github.com/richarddli/mvts_transformer/tree/sktime0.22, which also has some minor patches to run on modern sktime etc. (see this draft https://github.com/gzerveas/mvts_transformer/pull/56).