xinformatics commented 3 years ago

Hi, I am trying to run baseline models on Colab Pro. Even after having 25GB RAM, the Colab server is crashing. How do I use the Ontheflyfunction (originally used in test_inference_gnn.py) for training and convert directly from SMILES to graph instead of storing them in memory.

weihua916 commented 3 years ago

Hi! Could you elaborate on how your colab crushes? Does it show out of memory error (since I think 25GB is enough)?

xinformatics commented 3 years ago

I am trying the run the code on colab pro;

!python main_gnn.py --gnn gcn --log_dir /content/log_dir --checkpoint_dir /content/checkpoints --save_test_dir /content/save_test

and getting the following output

Namespace(batch_size=256, checkpoint_dir='/content/checkpoints', device=0, drop_ratio=0, emb_dim=600, epochs=100, gnn='gcn', graph_pooling='sum', log_dir='/content/log_dir', num_layers=5, num_workers=0, save_test_dir='/content/save_test', train_subset=False) Downloading https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m_kddcup2021.zip Downloaded 0.06 GB: 100% 59/59 [00:01<00:00, 45.18it/s] Extracting dataset/pcqm4m_kddcup2021.zip Processing... Converting SMILES strings into graphs... 49% 1868713/3803453 [20:57<20:54, 1542.73it/s][17:42:12] WARNING: could not find number of expected rings. Switching to an approximate ring finding algorithm. 69% 2641432/3803453 [29:32<12:26, 1556.73it/s][17:50:47] WARNING: not removing hydrogen atom without neighbors [17:50:47] WARNING: not removing hydrogen atom without neighbors 73% 2768881/3803453 [30:53<11:17, 1526.20it/s][17:52:08] WARNING: not removing hydrogen atom without neighbors 73% 2775031/3803453 [30:57<10:49, 1582.66it/s][17:52:12] WARNING: not removing hydrogen atom without neighbors 73% 2779208/3803453 [31:00<11:28, 1486.94it/s][17:52:15] WARNING: not removing hydrogen atom without neighbors 76% 2876993/3803453 [32:02<09:45, 1582.56it/s][17:53:17] WARNING: not removing hydrogen atom without neighbors 76% 2901209/3803453 [32:17<09:42, 1549.18it/s][17:53:32] WARNING: not removing hydrogen atom without neighbors [17:53:32] WARNING: not removing hydrogen atom without neighbors [17:53:32] WARNING: not removing hydrogen atom without neighbors 77% 2915759/3803453 [32:27<09:27, 1563.17it/s][17:53:42] WARNING: not removing hydrogen atom without neighbors [17:53:42] WARNING: not removing hydrogen atom without neighbors [17:53:42] WARNING: not removing hydrogen atom without neighbors 79% 3022103/3803453 [33:35<08:02, 1619.99it/s][17:54:50] WARNING: not removing hydrogen atom without neighbors 91% 3445482/3803453 [38:13<03:59, 1497.04it/s][17:59:28] WARNING: not removing hydrogen atom without neighbors [17:59:28] WARNING: not removing hydrogen atom without neighbors 99% 3751461/3803453 [41:37<00:31, 1662.94it/s][18:02:52] WARNING: not removing hydrogen atom without neighbors [18:02:52] WARNING: not removing hydrogen atom without neighbors [18:02:52] WARNING: not removing hydrogen atom without neighbors 99% 3752849/3803453 [41:38<00:29, 1729.12it/s][18:02:53] WARNING: not removing hydrogen atom without neighbors [18:02:53] WARNING: not removing hydrogen atom without neighbors 99% 3759712/3803453 [41:42<00:27, 1612.35it/s][18:02:57] WARNING: not removing hydrogen atom without neighbors 99% 3762621/3803453 [41:44<00:25, 1607.81it/s][18:02:59] WARNING: not removing hydrogen atom without neighbors 99% 3779416/3803453 [41:54<00:14, 1666.74it/s][18:03:09] WARNING: not removing hydrogen atom without neighbors 100% 3793012/3803453 [42:03<00:07, 1474.85it/s][18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors 100% 3795548/3803453 [42:05<00:04, 1663.99it/s][18:03:20] WARNING: not removing hydrogen atom without neighbors 100% 3803453/3803453 [42:10<00:00, 1502.85it/s] tcmalloc: large alloc 3874652160 bytes == 0x562bde428000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b tcmalloc: large alloc 1772797952 bytes == 0x562cc5cc4000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b tcmalloc: large alloc 2659196928 bytes == 0x562d2f770000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b

After this the runtime crashes; with all the RAM (about 25 GB) all used up

weihua916 commented 3 years ago

Interesting. The best solution is to get more RAM...

If you want to use OnTheFlyPCQMDataset during training, you can replace https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/test_inference_gnn.py#L103-L105 as follows.

train_smiles_dataset = [smiles_dataset[i] for i in split_idx['train']]
onthefly_train_dataset = OnTheFlyPCQMDataset(smiles_dataset)
train_loader = DataLoader(onthefly_train_dataset, batch_size=256 shuffle=True, num_workers = args.num_workers)

Then, use this loader in the main training script: https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/main_gnn.py#L128 You probably need a lot of num_workers to make sure the data loading speed is sufficiently fast.

xinformatics commented 3 years ago

I tried what you suggested; Changed the train, valid, and test as suggested.

Ran

!python main_gnn.py --gnn gcn --num_workers 2 --log_dir /content/log_dir --checkpoint_dir /content/checkpoints --save_test_dir /content/save_test

It gives an error;

`Namespace(batch_size=256, checkpoint_dir='/content/checkpoints', device=0, drop_ratio=0, emb_dim=600, epochs=100, gnn='gcn', graph_pooling='sum', log_dir='/content/log_dir', num_layers=5, num_workers=2, save_test_dir='/content/save_test', train_subset=False)

Params: 1955401

=====Epoch 1 Training... Iteration: 0% 0/11896 [00:01<?, ?it/s] Traceback (most recent call last): File "main_gnn.py", line 228, in main() File "main_gnn.py", line 195, in main train_mae = train(model, device, train_loader, optimizer) File "main_gnn.py", line 34, in train loss = reg_criterion(pred, batch.y) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 96, in forward return F.l1_loss(input, target, reduction=self.reduction) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 2887, in l1_loss if not (target.size() == input.size()): AttributeError: 'NoneType' object has no attribute 'size'`

weihua916 commented 3 years ago

I see.

You will need to insert the following line before https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/test_inference_gnn.py#L55

data.y = torch.FloatTensor([y])

This adds the target value to the PyG data object during training (which was not necessary for the test inference).

xinformatics commented 3 years ago

Thanks. It worked. It reduced the RAM usage heavily (about 6GB / 25 GB). Thanks so much, at least I am able to check the baselines. Right now, with num_workers set to 4, Training takes about 20 mins per epoch, validation takes 2.5 mins, and generation of the prediction test set takes 2 mins.

snap-stanford / ogb

How to use OnTheFlyPCQMDataset function during Training? #183

Params: 1955401