snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Hyperparameters and runtimes of performance table for MAG240M #177

Closed jwkirchenbauer closed 3 years ago

jwkirchenbauer commented 3 years ago

Kudos to @rusty1s and @weihua916 for all the work building/maintaining such a solid codebase, especially the examples as reference for the OGB-LSC

Re: the challenge baseline would it be possible to get a confirmation/tabulation of the hyperparameter settings used to achieve the MAG240M Performance table?

Second, do approximate runtime estimates exist for the training done on the GeForce RTX 2080 Ti (11GB GPU)? I'm attempting to reproduce on a different setup and working on troubleshooting disk/memory/batching latencies and possibly multi-gpu runs and would love to have a reference. I noticed a '24 hour' figure given as an example on the challenge participant info page and wonder how close that is?

Thanks!

weihua916 commented 3 years ago

The default hyper-parameters should work.

For approximate runtime, see https://github.com/snap-stanford/ogb/discussions/121#discussioncomment-506595 Note: The runtime highly depends on the speed of your disk read (not GPU spec).

jwkirchenbauer commented 3 years ago

Thanks!

As an update, the baseline code with minor modifications to use lightning's multi-gpu support I can confirm that the baseline MAG240M code runs as expected on one V100 gpu, replicating the GraphSAGE results at 67.3 val acc and also on multiple GPUs cutting the 100 epoch train time down to 3 hrs (whats the eta on the hidden test set server out of curiosity?)

As noted in other threads like those mentioned #131 the disk speed matters due to how the data is loaded using numpy.memmap and I had an issue using a network file store that was quite slow on the batch reads, so I moved to local disk to solve that.

Loading to RAM with --in-memory does also partially solve that problem (after the initial caching) but because of how pytorch/lightning DDP works it ends up replicating data (~200gb*n_gpus) if you scale more gpus, so I'll probably look into that a bit to see if theres an elegant extension to the example script's dataloaders. This is a known pain point in the Dataloader(Dataset) interface using blocked array formats like .npy or .hdf5 with multiprocessing/torchDDP where you might want to share cached data between loaders/over epochs. If I'm missing a thread where this was addressed in more detail I'm sorry.

Hopefully going to verify more of the models and try the other datasets soon.

rusty1s commented 3 years ago

Curious if you have any ideas how to solve the replicated data loading on multi GPU training. I guess one option is to load the full_feat.npy into shared memory in the main process, and just pass it to the sub-processes.

jwkirchenbauer commented 3 years ago

So initially, since the code is using the LightningDataModule interface with key methods prepare_data and setup I attempted to see if there was a way to hack it there, but couldn't. The docs do say that prepare_data is called once in a DDP group and setup on every GPU and it's semi-explicitly commented that you shouldn't attempt to cache shared state in either case.

So manually using shared memory might be the right way to do it, since it's not clear you can achieve it within the lightning patterns. Since we don't have access to the actual process launching it's harder for me to figure out whether this is doable.

But I am going to give this a shot, I'll get back to you

On the other hand, I was thinking about trying to do some sort of sharded dataset solution, slicing based on the gpu rank of the process, but on the theory side this doesn't make a lot of sense to me in the context of the graph and NeighborhoodSampler (I'm newer to graph problems though). Is a sharded solution possible in your mind?

rusty1s commented 3 years ago

I don't think a sharded dataset solution is viable here, as the neighbors of nodes might be located anywhere in the node_feat.npy. @tchaton Any idea on how to integrate that in PyTorch Lightning?

jwkirchenbauer commented 3 years ago

I don't think a sharded dataset solution is viable here, as the neighbors of nodes might be located anywhere in the node_feat.npy. @tchaton Any idea on how to integrate that in PyTorch Lightning?

Yea I just took a closer look at PyG NeighborSampler to remind what it meant and that makes sense

shangyihao commented 3 years ago

Thanks!

As an update, the baseline code with minor modifications to use lightning's multi-gpu support I can confirm that the baseline MAG240M code runs as expected on one V100 gpu, replicating the GraphSAGE results at 67.3 val acc and also on multiple GPUs cutting the 100 epoch train time down to 3 hrs (whats the eta on the hidden test set server out of curiosity?)

As noted in other threads like those mentioned #131 the disk speed matters due to how the data is loaded using numpy.memmap and I had an issue using a network file store that was quite slow on the batch reads, so I moved to local disk to solve that.

Loading to RAM with --in-memory does also partially solve that problem (after the initial caching) but because of how pytorch/lightning DDP works it ends up replicating data (~200gb*n_gpus) if you scale more gpus, so I'll probably look into that a bit to see if theres an elegant extension to the example script's dataloaders. This is a known pain point in the Dataloader(Dataset) interface using blocked array formats like .npy or .hdf5 with multiprocessing/torchDDP where you might want to share cached data between loaders/over epochs. If I'm missing a thread where this was addressed in more detail I'm sorry.

Hopefully going to verify more of the models and try the other datasets soon.

Hi, I also attempted to use multiple-gpu training base on the baseline code, but failed to modify(after changing the trainer with accelerator, failed to modify the batchsize or other problems). Could you give me some details of the minor modifications.

Thanks!