google-research / smore

Apache License 2.0
162 stars 28 forks source link

Training get stuck #10

Open AprLie opened 1 year ago

AprLie commented 1 year ago

Hi,

thanks for developing a useful tool for training larger-scale KG. However, when I use smore to train models like ComplexE or TransE on wikikgv2, it has about a 50% chance of getting stuck in the training step (i.e., after loading the data, and this can happen before or after the checkpoint save steps) . Have you encountered this issue?

BTW, I only find training scripts for TransE and ComplexE, but there are 4 other KGE models, I wonder why they are not trained on wikikgv2, or is there anything need to pay attention to when writing the training scripts?

Many thanks and look forward to your reply.

hyren commented 1 year ago

Hi, thanks for your interest. As for getting stuck, do you mean getting stuck right after data loading and before training, or during training? Any pointers on lines that get stuck / might cause the problem will be extremely helpful for us to check.

We provide TransE and ComplEx as example baselines for wikikgv2. We will support RotatE and DistMult later as well.

AprLie commented 1 year ago

Sorry for making you confused. The code gets stuck during training. In most cases, it happens during or after the validation steps (e.g. the tqdm bar stops when it does not reach the final number or just right after "100%" ).

image

When I try to find some cases for you, I encounter one other problem. It seems to appear when I train two models in one server (each model is trained on two GPUs and they will not use the same GPU).

image

update: I restart the model training (the other model is still in training) and it will soon raise the bus error._

Finally, I will appreciate it if you can tell me what changes should make for the running of RotatE and DistMult.

Thanks one more for the reply.

AprLie commented 1 year ago

image

One more case when the validation is finished.

fxmeng commented 1 year ago

Hi,

thanks for developing a useful tool for training larger-scale KG. However, when I use smore to train models like ComplexE or TransE on wikikgv2, it has about a 50% chance of getting stuck in the training step (i.e., after loading the data, and this can happen before or after the checkpoint save steps) . Have you encountered this issue?

BTW, I only find training scripts for TransE and ComplexE, but there are 4 other KGE models, I wonder why they are not trained on wikikgv2, or is there anything need to pay attention to when writing the training scripts?

Many thanks and look forward to your reply.

I have encountered all of the problems the same as you.

hyren commented 1 year ago

Hi, sorry for the late reply. We just pushed a hot fix of stucking during evaluation on wikikgv2 branch. Can you please pull the recent change and try again?