AttributeError: 'PCQM4MV2_Training' object has no attribute '_train_dataset'

yinzhangyue commented 2 years ago

Hi~ This error occurs when I use Pytorch DDP.

AttributeError: 'PCQM4MV2_Training' object has no attribute '_train_dataset'.

This is my config.

scheme: pcqm4mv2
model_name: egt_110m
# distributed: false         # Set = true for multi-gpu
distributed: true         # Set = true for multi-gpu
batch_size: 512            # For 8 GPUs: 512//8=64
model_height: 30
node_width: 768
edge_width: 64
num_heads: 32
num_epochs: 1000
max_lr: 8.0e-05
attn_dropout: 0.3
lr_warmup_steps: 240000
lr_total_steps: 1000000
node_ffn_multiplier: 1.0
edge_ffn_multiplier: 1.0
upto_hop: 16
dataloader_workers: 1      # For multi-process data fetch
scale_degree: true
num_virtual_nodes: 4
svd_random_neg: true

shamim-hussain commented 2 years ago

Could you share the full traceback with me? Also, you should divide the batch size by the number of GPUs you have (You should only use distributed training when you have multiple GPUs).

On Wed, Mar 30, 2022 at 12:11 PM Zhangyue Yin @.***> wrote:

Hi~ This error occurs when I use Pytorch DDP.

AttributeError: 'PCQM4MV2_Training' object has no attribute '_train_dataset'.

This is my config.

scheme: pcqm4mv2 model_name: egt_110m

distributed: false # Set = true for multi-gpu

distributed: true # Set = true for multi-gpu batch_size: 512 # For 8 GPUs: 512//8=64 model_height: 30 node_width: 768 edge_width: 64 num_heads: 32 num_epochs: 1000 max_lr: 8.0e-05 attn_dropout: 0.3 lr_warmup_steps: 240000 lr_total_steps: 1000000 node_ffn_multiplier: 1.0 edge_ffn_multiplier: 1.0 upto_hop: 16 dataloader_workers: 1 # For multi-process data fetch scale_degree: true num_virtual_nodes: 4 svd_random_neg: true

— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XWZQZMUIV2DLYDGCFLVCR4LJANCNFSM5SCOIPWQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

yinzhangyue commented 2 years ago

My computer is Titan Xp*8 and has 256GB RAM. The system is ubuntu 18.04 with PyTorch 1.10.2+cu113. I didn't make any changes to the code except for the configuration file. This is my Running Instructions.

python run_training.py configs/pcqm4mv2/egt_110m.yaml

This is my config.

scheme: pcqm4mv2
model_name: egt_110m
# distributed: false         # Set = true for multi-gpu
distributed: true         # Set = true for multi-gpu
batch_size: 64            # For 8 GPUs: 512//8=64
model_height: 30
node_width: 768
edge_width: 64
num_heads: 32
num_epochs: 1000
max_lr: 8.0e-05
attn_dropout: 0.3
lr_warmup_steps: 240000
lr_total_steps: 1000000
node_ffn_multiplier: 1.0
edge_ffn_multiplier: 1.0
upto_hop: 16
dataloader_workers: 1      # For multi-process data fetch
scale_degree: true
num_virtual_nodes: 4
svd_random_neg: true

The following is an Error.

Initiated rank: 0
Initiated rank: 3
Initiated rank: 7
Initiated rank: 5
Initiated rank: 2
Initiated rank: 6
Initiated rank: 4
Initiated rank: 1
Traceback (most recent call last):
  File "run_training.py", line 6, in <module>
    execute('train', config)
  File "/remote-home/zyyin/Experiment/OGB/egt_pytorch/lib/training/execute.py", line 87, in execute
    torch.multiprocessing.spawn(fn = run_worker,
  File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with signal SIGKILL

I think it maybe caused by the limitation of RAM. How can I solve it?

yinzhangyue commented 2 years ago

Without DDP, I can run it correctly.

shamim-hussain commented 2 years ago

Does it run for the smaller datasets, i.e. MolPCBA and MoHIV (remove the config for pretrained weights)? If so, that might confirm that it's occurring because of memory limitations.

My script actually loads the whole dataset into memory for each process that is spawned (total memory consumed = dataset size x no. of workers) - which is inefficient. I plan on improving it in the future with shared memory among the processes. Or you could try to implement that.

On Thu, Mar 31, 2022 at 9:35 AM Zhangyue Yin @.***> wrote:

Without DDP, I can run it correctly.

— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1#issuecomment-1084586722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XXGZMH5AI5ZYDQWGRTVCWS3RANCNFSM5SCOIPWQ . You are receiving this because you commented.Message ID: @.***>

shamim-hussain commented 2 years ago

Another option - you may want to take the model and train it with a different framework such as PyTorch Lightning. Note that, in my code, the model, the dataset, and the training logic are separate - in different directories, you could easily replace the training logic with your own.

On Thu, Mar 31, 2022 at 10:54 AM Shamim Hussain @.***> wrote:

Does it run for the smaller datasets, i.e. MolPCBA and MoHIV (remove the config for pretrained weights)? If so, that might confirm that it's occurring because of memory limitations.

My script actually loads the whole dataset into memory for each process that is spawned (total memory consumed = dataset size x no. of workers) - which is inefficient. I plan on improving it in the future with shared memory among the processes. Or you could try to implement that.

On Thu, Mar 31, 2022 at 9:35 AM Zhangyue Yin @.***> wrote:

Without DDP, I can run it correctly.

— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1#issuecomment-1084586722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XXGZMH5AI5ZYDQWGRTVCWS3RANCNFSM5SCOIPWQ . You are receiving this because you commented.Message ID: @.***>

yinzhangyue commented 2 years ago

OK. You are right! The RAM consumption is so large due to each process needs 50GB of RAM. This is unacceptable in my machines. I will try to improve it. Thx~

shamim-hussain / egt_pytorch

AttributeError: 'PCQM4MV2_Training' object has no attribute '_train_dataset' #1

distributed: false # Set = true for multi-gpu