snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Stuck at Validation sanity check #347

Closed MIracleyin closed 1 year ago

MIracleyin commented 1 year ago

Hi, dear ogb team. I want to take part in OGB-LSC, but now I am stuck on some problem, and I can't find any useful issues in this repo.

I am run example/lsc/mag240m/gnn.py and I get the following results: Namespace(hidden_channels=1024, batch_size=1024, dropout=0.5, epochs=100, model='gat', sizes=[25, 15], in_memory=False, device='0', evaluate=False) Global seed set to 42

Params 4890777

GPU available: True, used: True TPU available: None, using: 0 TPU cores Converting adjacency matrix... Done! [627.29s] Reading dataset... Done! [130.84s]

| Name | Type | Params

0 | convs | ModuleList | 1.8 M 1 | norms | ModuleList | 4.1 K 2 | skips | ModuleList | 1.8 M 3 | mlp | Sequential | 1.2 M 4 | train_acc | Accuracy | 0
5 | val_acc | Accuracy | 0
6 | test_acc | Accuracy | 0

4.9 M Trainable params 0 Non-trainable params 4.9 M Total params 19.563 Total estimated model params size (MB) Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]

is this normal? thanks for your replies.

MIracleyin commented 1 year ago

I find this issue caused by pytorch_lightning, I set num_sanity_val_steps=0

trainer = Trainer(gpus=args.device, max_epochs=args.epochs,
                 callbacks=[checkpoint_callback],
                 default_root_dir=f'logs/{args.model}',num_sanity_val_steps=0)

but I still stuck at Trainer

Namespace(hidden_channels=1024, batch_size=1024, dropout=0.5, epochs=100, model='gat', sizes=[25, 15], in_memory=False, device='0', evaluate=False) Global seed set to 42

Params 4890777

GPU available: True, used: True TPU available: None, using: 0 TPU cores Reading dataset... Done! [147.73s]

| Name | Type | Params

0 | convs | ModuleList | 1.8 M 1 | norms | ModuleList | 4.1 K 2 | skips | ModuleList | 1.8 M 3 | mlp | Sequential | 1.2 M 4 | train_acc | Accuracy | 0
5 | val_acc | Accuracy | 0
6 | test_acc | Accuracy | 0

4.9 M Trainable params 0 Non-trainable params 4.9 M Total params 19.563 Total estimated model params size (MB) Epoch 0: 0%| | 0/1223 [00:00<?, ?it/s]

MIracleyin commented 1 year ago

I find that the stuck is due to disk speed, and i will close the comment. Thanks, ogb team.