thuml / Autoformer

About Code release for "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting" (NeurIPS 2021), https://arxiv.org/abs/2106.13008
MIT License
2k stars 429 forks source link

The training process not even started #81

Closed Daniel-Jiang358 closed 2 years ago

Daniel-Jiang358 commented 2 years ago

Hello, I am running this code on my severer, however, the training process won't started while there are still normal gpu occupation.

Daniel-Jiang358 commented 2 years ago

3c96a7deaf7dd3f5d54833ba04c9676 666adaf06d9fe326052eb55a9ba27f7 It has been stuck in this situation for very long time

Daniel-Jiang358 commented 2 years ago

Some other codes can be run normally, and autoformer is not even started

wuhaixu2016 commented 2 years ago

Are you using your own datasets? It will be very helpful for us to check the code if you provide a subset of your dataset.

Daniel-Jiang358 commented 2 years ago

I have found the reasons. But would you please explain the reason that training on one A40-Link is much faster than multiple A40-Link? I tried other ones. Like A5000,3090,the same results. When I test on the ECL dataset on a5000,when using one card,each iter cost 0.04s,while on 8 cards,about 3s. I consider you abnormal.

---Original--- From: @.> Date: Tue, Sep 6, 2022 16:43 PM To: @.>; Cc: "Daniel @.**@.>; Subject: Re: [thuml/Autoformer] The training process not even started (Issue#81)

Are you using your own datasets? It will be very helpful if you provide a subset of your dataset.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wuhaixu2016 commented 2 years ago

I think time series forecasting is a light task. Maybe the multi-gpu training will cause extra communication costs.