Testing procedure - Githubissues

Woutah commented 2 years ago

First of all, thank you for providing such a complete implementation of your code. In the paper you mention that "Ater fixing the hyperparameters, the entire training set was used to train the model again, which was finally evaluated on the official test set.". Could you explain the way in which this final training procedure (on the entire training set) was performed?

Was a predefined amount of epochs used to train the model, after which it was evaluated on the testset? Or was the testset used as a validation set?

Thanks in advance.

gzerveas commented 2 years ago

Yes, you can consider the number of epochs a "hyperparameter". Once you find out what it should be for each dataset, based on the original validation split, you use this predesignated number to train the model on the entire training set. After training, you can use the --test_only option to evaluate on the test set. However, in practice, it can be more convenient (i.e. it spares you a run) if for this last training session, you define the test set as a validation set (using e.g. --val_pattern TEST), and you simply read out the evaluation performance for this "validation set". This can also be interesting if you want to look into robustness: even if you allow training to progress longer, you can check what was the performance at the predesignated number of epochs, and see whether a substantially better performance on the test set was actually recorded earlier/later during training. For most datasets, I think that this probably wouldn't be the case.

Woutah commented 2 years ago

Thank you for your quick response. Do the accuracies reported in the paper correspond to the maximum performance on the testset (in this case validation set) in this last training session?

gzerveas commented 2 years ago

No, as I wrote above, they should correspond to the predesignated number - and the hope is that this would be anyway close to the maximum performance.

Woutah commented 2 years ago

I have some trouble when training on the the multivariate classification datasets, that is why I asked just to be sure. Would it be possible to provide the hyperparameters that were used during training? In particular the used learning rate and batch sizes would (probably) help me out a lot, as I am experiencing some instability when training.

gzerveas commented 2 years ago

Sure, these tables with hyperparameters are from the KDD paper:

Regarding the learning rate, as far as I remember it was always set to 0.001 (the main reason for using RAdam was to make training insensitive to the learning rate). The batch size for most datasets was 128, and for some I believe 64 and 32. Are you interested in a particular dataset? I can try to find the configuration file, that contains the full set of hyperparameters.

donghucey commented 2 years ago

i am trying do training dataset AppliancesEnergy,can you give configuration file about this?

Woutah commented 1 year ago

Sorry for my late response. I am still working on this project and I am currently running some experiments again, I think my problem from before had to do with my batch sizes being too small.

Do you maybe have a list of the (approximate) batch sizes and epoch counts used in the experiments for the supervised multivariate classification datasets/task? I'd like to reproduce all classification-dataset experiments as closely to the paper as possible.

Woutah commented 1 year ago

I am currently struggling with the configuration for SCP2, I tried with batch sizes 32, 64 and 96 but I am unable to get stable training performance resulting in the accuracy mentioned in the paper. Any help would be greatly appreciated.

gzerveas commented 1 year ago

In this dataset I got the best results (in the self-supervised-followed-by-finetuning case) when using a sub-sampling factor of 3 (via the option --subsample_factor ), which could potentially make a big difference, and a batch size of 32. But I think that you may be right that this dataset in particular shows instability when training; the evaluation on the validation set was fluctuating around 0.6, and for certain hyperparameter configurations or even runs this was reached very early on, and for some after ~600 epochs. In the end I chose some kind of intermediate value for training, like 100.

Woutah commented 1 year ago

Thanks for your help. I'll try that and report back. Have you used subsampling on any other classification datasets that you know of?

I had a couple of (multivariate classification) datasets for which I did not get the same performance when using the default parameters. If, perchance, you have some list with a general overview of what parameters were used for what datasets, that would probably make it a lot easier to reproduce the results.

One last question; how was table 9 with the standard deviations constructed? Are these the test-accuracies after x training runs using the same configuration?

Woutah commented 1 year ago

I attached an image here of several runs using the settings from above with valset=testset (so "optimal" situation), but I'm still not reaching an accuracy of ~0.6. Can you think of any other changes in the configuration?

gzerveas / mvts_transformer

Testing procedure #11