Closed xinformatics closed 3 years ago
Hi! Could you elaborate on how your colab crushes? Does it show out of memory error (since I think 25GB is enough)?
I am trying the run the code on colab pro;
!python main_gnn.py --gnn gcn --log_dir /content/log_dir --checkpoint_dir /content/checkpoints --save_test_dir /content/save_test
and getting the following output
Namespace(batch_size=256, checkpoint_dir='/content/checkpoints', device=0, drop_ratio=0, emb_dim=600, epochs=100, gnn='gcn', graph_pooling='sum', log_dir='/content/log_dir', num_layers=5, num_workers=0, save_test_dir='/content/save_test', train_subset=False) Downloading https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m_kddcup2021.zip Downloaded 0.06 GB: 100% 59/59 [00:01<00:00, 45.18it/s] Extracting dataset/pcqm4m_kddcup2021.zip Processing... Converting SMILES strings into graphs... 49% 1868713/3803453 [20:57<20:54, 1542.73it/s][17:42:12] WARNING: could not find number of expected rings. Switching to an approximate ring finding algorithm. 69% 2641432/3803453 [29:32<12:26, 1556.73it/s][17:50:47] WARNING: not removing hydrogen atom without neighbors [17:50:47] WARNING: not removing hydrogen atom without neighbors 73% 2768881/3803453 [30:53<11:17, 1526.20it/s][17:52:08] WARNING: not removing hydrogen atom without neighbors 73% 2775031/3803453 [30:57<10:49, 1582.66it/s][17:52:12] WARNING: not removing hydrogen atom without neighbors 73% 2779208/3803453 [31:00<11:28, 1486.94it/s][17:52:15] WARNING: not removing hydrogen atom without neighbors 76% 2876993/3803453 [32:02<09:45, 1582.56it/s][17:53:17] WARNING: not removing hydrogen atom without neighbors 76% 2901209/3803453 [32:17<09:42, 1549.18it/s][17:53:32] WARNING: not removing hydrogen atom without neighbors [17:53:32] WARNING: not removing hydrogen atom without neighbors [17:53:32] WARNING: not removing hydrogen atom without neighbors 77% 2915759/3803453 [32:27<09:27, 1563.17it/s][17:53:42] WARNING: not removing hydrogen atom without neighbors [17:53:42] WARNING: not removing hydrogen atom without neighbors [17:53:42] WARNING: not removing hydrogen atom without neighbors 79% 3022103/3803453 [33:35<08:02, 1619.99it/s][17:54:50] WARNING: not removing hydrogen atom without neighbors 91% 3445482/3803453 [38:13<03:59, 1497.04it/s][17:59:28] WARNING: not removing hydrogen atom without neighbors [17:59:28] WARNING: not removing hydrogen atom without neighbors 99% 3751461/3803453 [41:37<00:31, 1662.94it/s][18:02:52] WARNING: not removing hydrogen atom without neighbors [18:02:52] WARNING: not removing hydrogen atom without neighbors [18:02:52] WARNING: not removing hydrogen atom without neighbors 99% 3752849/3803453 [41:38<00:29, 1729.12it/s][18:02:53] WARNING: not removing hydrogen atom without neighbors [18:02:53] WARNING: not removing hydrogen atom without neighbors 99% 3759712/3803453 [41:42<00:27, 1612.35it/s][18:02:57] WARNING: not removing hydrogen atom without neighbors 99% 3762621/3803453 [41:44<00:25, 1607.81it/s][18:02:59] WARNING: not removing hydrogen atom without neighbors 99% 3779416/3803453 [41:54<00:14, 1666.74it/s][18:03:09] WARNING: not removing hydrogen atom without neighbors 100% 3793012/3803453 [42:03<00:07, 1474.85it/s][18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors [18:03:18] WARNING: not removing hydrogen atom without neighbors 100% 3795548/3803453 [42:05<00:04, 1663.99it/s][18:03:20] WARNING: not removing hydrogen atom without neighbors 100% 3803453/3803453 [42:10<00:00, 1502.85it/s] tcmalloc: large alloc 3874652160 bytes == 0x562bde428000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b tcmalloc: large alloc 1772797952 bytes == 0x562cc5cc4000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b tcmalloc: large alloc 2659196928 bytes == 0x562d2f770000 @ 0x7f91d693eb6b 0x7f91d695e379 0x7f9180cc125e 0x7f9180cc29d2 0x7f91c257a2c3 0x7f91c1d7e43f 0x7f91c252befa 0x7f91c1e1d6a1 0x7f91c1e1de95 0x7f91c23061a5 0x7f91c227aaa1 0x7f91c21013c5 0x7f91c1e1f8d7 0x7f91c23c2ab8 0x7f91c23c2b38 0x7f91c227aaa1 0x7f91c2100e25 0x7f91c3929487 0x7f91c3929a88 0x7f91c227aaa1 0x7f91c2100e25 0x7f91d3672fe8 0x5627fd364d36 0x5627fd364db1 0x5627fd3d0a85 0x5627fd36420b 0x5627fd3d0229 0x5627fd314b00 0x5627fd364497 0x5627fd3cbe70 0x5627fd36420b
After this the runtime crashes; with all the RAM (about 25 GB) all used up
Interesting. The best solution is to get more RAM...
If you want to use OnTheFlyPCQMDataset
during training, you can replace https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/test_inference_gnn.py#L103-L105 as follows.
train_smiles_dataset = [smiles_dataset[i] for i in split_idx['train']]
onthefly_train_dataset = OnTheFlyPCQMDataset(smiles_dataset)
train_loader = DataLoader(onthefly_train_dataset, batch_size=256 shuffle=True, num_workers = args.num_workers)
Then, use this loader in the main training script: https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/main_gnn.py#L128 You probably need a lot of num_workers to make sure the data loading speed is sufficiently fast.
I tried what you suggested; Changed the train, valid, and test as suggested.
Ran
!python main_gnn.py --gnn gcn --num_workers 2 --log_dir /content/log_dir --checkpoint_dir /content/checkpoints --save_test_dir /content/save_test
It gives an error;
`Namespace(batch_size=256, checkpoint_dir='/content/checkpoints', device=0, drop_ratio=0, emb_dim=600, epochs=100, gnn='gcn', graph_pooling='sum', log_dir='/content/log_dir', num_layers=5, num_workers=2, save_test_dir='/content/save_test', train_subset=False)
=====Epoch 1
Training...
Iteration: 0% 0/11896 [00:01<?, ?it/s]
Traceback (most recent call last):
File "main_gnn.py", line 228, in
I see.
You will need to insert the following line before https://github.com/snap-stanford/ogb/blob/master/examples/lsc/pcqm4m/test_inference_gnn.py#L55
data.y = torch.FloatTensor([y])
This adds the target value to the PyG data object during training (which was not necessary for the test inference).
Thanks. It worked. It reduced the RAM usage heavily (about 6GB / 25 GB). Thanks so much, at least I am able to check the baselines. Right now, with num_workers set to 4, Training takes about 20 mins per epoch, validation takes 2.5 mins, and generation of the prediction test set takes 2 mins.
Hi, I am trying to run baseline models on Colab Pro. Even after having 25GB RAM, the Colab server is crashing. How do I use the Ontheflyfunction (originally used in test_inference_gnn.py) for training and convert directly from SMILES to graph instead of storing them in memory.