Questions about patch length and augmentation

dyhan316 commented 3 weeks ago

Thank you so much for the wonderful paper! We are trying to apply your model (Timer-XL to be exact) to our large iEEG data (~100B timepoints) to create a foundation model and had a few questions if you don't mind. Your insights would be greatly helpful in applying your model to our data!

Why did you choose a patch length of 96? : Why did you choose a (quite-long) patch length of 96, and are there things we should consider when choosing the optimal patch length? (ex : should we choose length that is long enough so that next token prediction is hard?)
What do you think of patching with multiple patch lengths? : Recent papers such as MTST , Pathformer , and Medformer seem to argue that using multiple patch sizes enables the model to learn various dynamics at different frequencies/granularities. Given that Timer uses a single patch size, we were wondering what you think about this. (Modeling various granularities is critical for us as neural data is very multi-scale with meaningful frequencies ranging from 0.5Hz to 250Hz)
Did you pre-train a separate imputation model? : It seemed that the patch lengths for the imputation and pretraining were different (24 vs 96). Does this mean that you pre-trained a model specifically for imputation?
What do you think about augmentation? : Do you think that augmentation is unnecessary when training with lots of data like in Timer? Or do you think domain-specific augmentations could still help? Would masking several segments during pretraining help make the SSL task more difficult, and hence better (similar to how the model was pre-trained during imputation)?

Thank you in advance for you and your group's amazing work! It's really helping us get into time-series modeling for brain data!

WenWeiTHU commented 2 weeks ago

Hello, nice to you again!

The token length is set to 96 mainly for consideration of the paper benchmark. You can try to use a longer token length based on the predicted length, which can significantly avoid the accumulation of errors.
Of course. Given the prior knowledge about the frequencies in the brain data, we strongly recommend multiple patch lengths by using multiple embedding layers.
Yes, but they are both pre-trained on the UTSD. Since the patch requirement for forecasting and imputation is quite different, you can also train a new embedding layer starting from the forecasting checkpoint for other tasks.
I'm a little unsure about that because I am not familiar with domain-specific augmentations. If you are using a pre-trained model for tasks other than generative tasks (e.g., anomaly detection, classification), we think masking modeling should also be worth trying.

Hope you find these answers helpful :)

dyhan316 commented 2 weeks ago

Thank you for your responses!

Oh I see! Makes sens to not do rolling prediction.
I see! Are you by any chance working on multi-patch size methods? I know in the appendix of your Timer-XL paper, you said your next work will be multi-resolution patches! (I feel like simply using multiple patch sizes then applying attention across them would greatly increase the memory used?)
Oh I see! Thank you
Hmmm I see. Most of the downstream-tasks in EEG is classification, so I guess we should also try masking modeling! I hope the OpenLTM is updated to also include masked pre-training! :)

Thank you!

thuml / Large-Time-Series-Model

Questions about patch length and augmentation #20