snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Support for single GPU and single Process training on wikikg90m. #138

Closed jianyucai closed 3 years ago

jianyucai commented 3 years ago

Thanks for your baseline code. It seems that the baseline code does not support single GPU and single Process training on wikikg90m.

I tried to use the following training script to train the model on a single GPU with a single process by specifying --gpu 0 and --num_proc 1.

CUDA_VISIBLE_DEVICES=0 dglke_train \
--model_name ComplEx \
--hidden_dim 100 --gamma 3 --lr 0.1 --regularization_coef 1e-9 \
--valid --test -adv --mix_cpu_gpu --num_proc 1 --num_thread 1 \
--gpu 0 \
--async_update --force_sync_interval 50000 --no_save_emb \
--print_on_screen --encoder_model_name concat -de -dr --save_path ./save_path

However, the baseline code makes the following assertion, and args.strict_rel_part must be true if we specify --gpu 0 and --num_proc 1

https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/models/general_models.py#L262

The reasons are as follows.

The variable strict_rel_part is defined as follows.

https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/train.py#L152

args.mix_cpu_gpu must be set to true in order to support the large number of entities.

cross_part is defined as follows.

https://github.com/snap-stanford/ogb/blob/2229cb40781a8bb3505f0f379869d43e9a74beb9/examples/lsc/wikikg90m/dgl-ke-ogb-lsc/python/dglke/dataloader/sampler.py#L333-L343

If we set --num_proc 1, which is the variable ranks above, then cross_part must be false

My question is, did I miss something or the baseline code actually does not support training with single GPU and single Process?

hyren commented 3 years ago

Hi, one simple fix is to set self.cross_part = True on line 342 of the sampler.py so that strict_rel_part is False. Because either way since you only train the model on one gpu and one process, you do not need to partition the relations or create the global_relation_emb. At the meantime, we are also doing some test runs for this fix, and will push the fix once we finish the tests. Thank you.

hyren commented 3 years ago

Hi, we have pushed the fix. Now you can run the code on single GPU, and an example is shown below. Note that we do not need the async_update flag since we only have one process.

CUDA_VISIBLE_DEVICES=0 dglke_train \
--model_name ComplEx \
--hidden_dim 100 --gamma 3 --lr 0.1 --regularization_coef 1e-9 \
--valid --test -adv --mix_cpu_gpu --num_proc 1 --num_thread 1 \
--gpu 0 \
--no_save_emb \
--print_on_screen --encoder_model_name concat -de -dr --save_path ./save_path
jianyucai commented 3 years ago

Thanks for your kind response!