Closed jianyucai closed 3 years ago
Hi, one simple fix is to set self.cross_part = True
on line 342 of the sampler.py
so that strict_rel_part
is False
. Because either way since you only train the model on one gpu and one process, you do not need to partition the relations or create the global_relation_emb
. At the meantime, we are also doing some test runs for this fix, and will push the fix once we finish the tests. Thank you.
Hi, we have pushed the fix. Now you can run the code on single GPU, and an example is shown below. Note that we do not need the async_update
flag since we only have one process.
CUDA_VISIBLE_DEVICES=0 dglke_train \
--model_name ComplEx \
--hidden_dim 100 --gamma 3 --lr 0.1 --regularization_coef 1e-9 \
--valid --test -adv --mix_cpu_gpu --num_proc 1 --num_thread 1 \
--gpu 0 \
--no_save_emb \
--print_on_screen --encoder_model_name concat -de -dr --save_path ./save_path
Thanks for your kind response!
Thanks for your baseline code. It seems that the baseline code does not support single GPU and single Process training on wikikg90m.
I tried to use the following training script to train the model on a single GPU with a single process by specifying
--gpu 0
and--num_proc 1
.However, the baseline code makes the following assertion, and
args.strict_rel_part
must betrue
if we specify--gpu 0
and--num_proc 1
https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/models/general_models.py#L262
The reasons are as follows.
The variable
strict_rel_part
is defined as follows.https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/train.py#L152
args.mix_cpu_gpu
must be set totrue
in order to support the large number of entities.cross_part
is defined as follows.https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/dataloader/sampler.py#L333-L343
If we set
--num_proc 1
, which is the variableranks
above, thencross_part
must befalse
My question is, did I miss something or the baseline code actually does not support training with single GPU and single Process?