GNN training on MAG240M hangs---slow loading of np.memmap

chenxuhao commented 3 years ago

Hello, I got this error when I was trying to run /ogb/examples/lsc/mag240m

$ python gnn.py --device=0 --model=graphsage
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=100, evaluate=False, hidden_channels=1024, model='graphsage', sizes=[25, 15])
Global seed set to 42
#Params 4884633
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [23.71s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 3.7 M
1 | norms | ModuleList | 4.1 K
2 | skips | ModuleList | 0
3 | mlp   | Sequential | 1.2 M
4 | acc   | Accuracy   | 0
-------------------------------------
4.9 M     Trainable params
0         Non-trainable params
4.9 M     Total params
19.539    Total estimated model params size (MB)
Traceback (most recent call last):
  File "gnn.py", line 231, in <module>
    trainer.fit(model, datamodule=datamodule)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 846, in run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 278, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, loader_name))
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 398, in request_dataloader
    dataloader = dataloader_fx()
  File "gnn.py", line 100, in val_dataloader
    return NeighborSampler(self.adj_t, node_idx=self.val_idx,
  File "/h2/xchen/.local/lib/python3.8/site-packages/torch_geometric/data/sampler.py", line 139, in __init__
    super(NeighborSampler, self).__init__(
TypeError: __init__() got an unexpected keyword argument 'transform'

I have installed PyTorch 1.8.0, pytorch_lightning-1.2.5 and also installed PyG: pip install git+https://github.com/rusty1s/pytorch_geometric.git

What am I missing here?

Thank you!

Xuhao Chen http://people.csail.mit.edu/xchen/

rusty1s commented 3 years ago

My guess is that you have multiple PyG versions installed. Try to run:

pip uninstall torch-geometric
pip uninstall torch-geometric  # Until no further versions are found
pip install git+https://github.com/rusty1s/pytorch_geometric.git

chenxuhao commented 3 years ago

Thank you! This works for me.

Now that it hangs like this:

$ python gnn.py --epochs 1 --hidden_channels 16
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=1, evaluate=False, hidden_channels=16, model='gat', sizes=[25, 15])
Global seed set to 42
#Params 28185
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [185.68s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 12.6 K
1 | norms | ModuleList | 64
2 | skips | ModuleList | 12.6 K
3 | mlp   | Sequential | 2.9 K
4 | acc   | Accuracy   | 0
-------------------------------------
28.2 K    Trainable params
0         Non-trainable params
28.2 K    Total params
0.113     Total estimated model params size (MB)
/jet/home/xhchen/anaconda3/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]

I know it is supposed to be slow, but how much time it is supposed to take in this stage?

Thanks,

Xuhao

rusty1s commented 3 years ago

Can you try to replace this line with self.x = self.all_paper_feat to see if that fixes this issue?

weihua916 commented 3 years ago

You can also try running the following to see if numpy's memmap mode is fast enough in your enviroment. Sometimes, we found this is slow.

import time
import torch
from ogb.lsc import MAG240MDataset
dataset = MAG240MDataset(ROOT_DIR)
x = dataset.paper_feat
idx1 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
idx2 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
t = time.perf_counter()
x[idx1]
print(time.perf_counter() - t)
t = time.perf_counter()
x[idx2]
print(time.perf_counter() - t)

chenxuhao commented 3 years ago

Got it! It runs now! Thanks!

snap-stanford / ogb

GNN training on MAG240M hangs---slow loading of np.memmap #131