google-research / smore

Apache License 2.0
162 stars 28 forks source link

Evaluation get stuck #12

Open Juanhui28 opened 1 year ago

Juanhui28 commented 1 year ago

Hi,

Seems there is still a chance for the evalution to get stuck. When we run the train_shallow_wikikgv2.sh , it runs after 4799999 steps and gets stuck in the evaluation. When we stop it with keyboard interrupt, we got the following message:

截屏2022-10-13 下午10 20 48

And when we run the train_concat_wikikgv2.sh , it stucks at the first time for the evaluation. When we stop it with keyboard interrupt, it shows similar error messages with the train_shallow_wikikgv2.sh. 截屏2022-10-13 下午10 23 14

Could you please help to check? Any help is appreciated!

hyren commented 1 year ago

Hi, can you try running with a single GPU?

Juanhui28 commented 1 year ago

Hi,

We tried a single gpu on both train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh, they both stuck in the evalution. Thanks.

hyren commented 1 year ago

Just to make sure, have you pulled the latest change? What is the script you are running? We will look into this and reproduce.

Juanhui28 commented 1 year ago

Hi, yes we have already pulled the latest change. We are running train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh in the training/vec_scripts folder. Thanks!

Hanjun-Dai commented 1 year ago

Hi there, I'm not sure if the gpu is compatible with the async op. Could you please kindly try to add --train_async_rw=False flag?

Juanhui28 commented 1 year ago

Hi, Thank you for the follow up. We add this flag in the script. And actually we found there is still a chance for the training to stuck with multiple gpusm, but it goes well with single gpu. Thank you!

Hanjun-Dai commented 1 year ago

really sorry for the back-and-forth! I guess it is mostly due to the compatibility of customized kernel. Would you mind sharing more information of the versions for your CUDA, pytorch and python?

Juanhui28 commented 1 year ago

Hi, the information is listed as follows: CUDA: 11.6 pytorch: 1.12.1 python: 3.9