The training process not even started

Daniel-Jiang358 commented 2 years ago

Hello, I am running this code on my severer, however, the training process won't started while there are still normal gpu occupation.

Daniel-Jiang358 commented 2 years ago

3c96a7deaf7dd3f5d54833ba04c9676 666adaf06d9fe326052eb55a9ba27f7 It has been stuck in this situation for very long time

Daniel-Jiang358 commented 2 years ago

Some other codes can be run normally, and autoformer is not even started

wuhaixu2016 commented 2 years ago

Are you using your own datasets? It will be very helpful for us to check the code if you provide a subset of your dataset.

Daniel-Jiang358 commented 2 years ago

I have found the reasons. But would you please explain the reason that training on one A40-Link is much faster than multiple A40-Link? I tried other ones. Like A5000，3090，the same results. When I test on the ECL dataset on a5000，when using one card，each iter cost 0.04s，while on 8 cards，about 3s. I consider you abnormal.

---Original--- From: @.> Date: Tue, Sep 6, 2022 16:43 PM To: @.>; Cc: "Daniel @.**@.>; Subject: Re: [thuml/Autoformer] The training process not even started (Issue#81)

Are you using your own datasets? It will be very helpful if you provide a subset of your dataset.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wuhaixu2016 commented 2 years ago

I think time series forecasting is a light task. Maybe the multi-gpu training will cause extra communication costs.

thuml / Autoformer

The training process not even started #81