thuml / Time-Series-Library

A Library for Advanced Deep Time Series Models.
MIT License
6.69k stars 1.05k forks source link

Cannot reproduce the classification result of TimesNet #494

Closed ArmandXiao closed 2 months ago

ArmandXiao commented 2 months ago

大家好,我初步对比了一下我们实验代码与TSLib之间的差别,主要变化在learning rate strategy,我们在整理算法库的时候额外增加了learning rate decay的过程,但是这一设计会降低模型训练的波动性,对于一些数据量较小的数据集反而不友好,因此我们在最近的commit中已经去掉了这一过程(https://github.com/thuml/Time-Series-Library/commit/1c7f843aee8ce75e758d45f0bd10c02516e36a92 ),我这边测试是可以复现结果的,请大家再次尝试

As mentioned in the issue, I add the following two lines from commit:1c7f843.

However, I am still not able to reproduce the result. Moreover, the average result dropped after making the amendment mentioned in the commit.

Here are my results using TimesNet: Table 17 Reproduce commit #1c7f843
EthanolConcentration 35.7 28.9 28.1
FaceDetection 68.6 66.3 68.0
Handwriting 32.1 31.8 17.4
Heartbeat 78.0 77.1 76.1
JapaneseVowels 98.4 97.3 93.0
PEMS-SF 89.6 86.7 75.7
SelfRegulationSCP1 91.8 89.8 90.1
SelfRegulationSCP2 57.2 51.1 52.2
SpokenArabicDigits 99.0 99.2 98.8
UWaveGestureLibrary 85.3 88.1 85.6
Avg 73.6 71.6 68.5

Thank you for your help.

wuhaixu2016 commented 2 months ago

Many thanks for your detailed reproduction and pointing out the problem of learning rate changing strategy.

(1) As I stated in the previous issue, some datasets of UEA suffer from serious limited data problems. Thus, their performance can be unstable. For example, under my experimental environment (w/o learning rate changing strategy), the Handwriting accuracy will be 0.33647058823529413. Here is the training log for this task.

Handwritting.log

(2) To clarify, I will public the training checkpoints under my experiments in two weeks.

eiriksteen commented 2 months ago

I have the same problem, I am not able to reproduce the results. How can we get past this when training our own models? I have a model that surpasses the TimesNet results that I have been able to reproduce, but not the ones in the paper. How can I be sure that my model is not trained in a suboptimal way leading to underestimated metrics?

In general, why aren't the metrics computed over multiple runs, with the mean and standard deviation being the final reported values?

wuhaixu2016 commented 2 months ago

Many thanks for your question and valuable discussion. I have uploaded the checkpoint files and training log here: https://cloud.tsinghua.edu.cn/d/caefcdb63eee4adfad86/

Here is the summary of our experiments classification.log :

Dataset Table 17 Our Exp
EthanolConcentration 35.7 31.94
FaceDetection 68.6 67.45
Handwriting 32.1 32.47
Heartbeat 78.0 80.97
JapaneseVowels 98.4 97.84
PEMS-SF 89.6 88.44
SelfRegulationSCP1 91.8 91.46
SelfRegulationSCP2 57.2 60.00
SpokenArabicDigits 99.0 98.95
UWaveGestureLibrary 85.3 88.13
Avg 73.6 73.76

(1) The inconsistency between Table 17 and Our Exp

As stated in the previous issue https://github.com/thuml/Time-Series-Library/issues/321#issuecomment-2227175691 , our original experimental code is based on this repo: https://github.com/thuml/Flowformer . To make the open-sourced code easy to read, I spent two weeks reorganizing the code and unified five tasks in a shared code base, that is TSlib. During the code organization, I may lose some details, such as the learning rate strategy, which is fixed in this commit https://github.com/thuml/Time-Series-Library/commit/1c7f843aee8ce75e758d45f0bd10c02516e36a92 (although I do remember that before I public this repo I have ensured all the performances could be reproduced).

In my current experiments, the averaged accuracy can be reproduced (a little bit better than the original paper). The only failed task is EthanolConcentration (35.7 v.s. 31.94). I plan to try my original code base and compare the training differences in every detail. If I have some new results, I will update them here, which may take some time.

(2) About the performance variance.

I have tried multiple runs and reported the std in our paper, which is around 0.1% for the average performance. The small subsets will be affected by random seeds differently, resulting in a kind of self-stable final performance.

To remove the high-variance tasks, I would suggest you omit EthanolConcentration, Handwriting and UWaveGestureLibrary and try some EEG datasets, which we have experimented with in this paper: https://arxiv.org/abs/2402.02475 .

Sorry for the inconvenience. If you have any questions, please email me or propose an issue in the repo.

eiriksteen commented 2 months ago

Thank you for the thorough response!