Closed yinzhangyue closed 2 years ago
Could you share the full traceback with me? Also, you should divide the batch size by the number of GPUs you have (You should only use distributed training when you have multiple GPUs).
On Wed, Mar 30, 2022 at 12:11 PM Zhangyue Yin @.***> wrote:
Hi~ This error occurs when I use Pytorch DDP.
AttributeError: 'PCQM4MV2_Training' object has no attribute '_train_dataset'.
This is my config.
scheme: pcqm4mv2 model_name: egt_110m
distributed: false # Set = true for multi-gpu
distributed: true # Set = true for multi-gpu batch_size: 512 # For 8 GPUs: 512//8=64 model_height: 30 node_width: 768 edge_width: 64 num_heads: 32 num_epochs: 1000 max_lr: 8.0e-05 attn_dropout: 0.3 lr_warmup_steps: 240000 lr_total_steps: 1000000 node_ffn_multiplier: 1.0 edge_ffn_multiplier: 1.0 upto_hop: 16 dataloader_workers: 1 # For multi-process data fetch scale_degree: true num_virtual_nodes: 4 svd_random_neg: true
— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XWZQZMUIV2DLYDGCFLVCR4LJANCNFSM5SCOIPWQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
My computer is Titan Xp*8 and has 256GB RAM. The system is ubuntu 18.04 with PyTorch 1.10.2+cu113. I didn't make any changes to the code except for the configuration file. This is my Running Instructions.
python run_training.py configs/pcqm4mv2/egt_110m.yaml
This is my config.
scheme: pcqm4mv2
model_name: egt_110m
# distributed: false # Set = true for multi-gpu
distributed: true # Set = true for multi-gpu
batch_size: 64 # For 8 GPUs: 512//8=64
model_height: 30
node_width: 768
edge_width: 64
num_heads: 32
num_epochs: 1000
max_lr: 8.0e-05
attn_dropout: 0.3
lr_warmup_steps: 240000
lr_total_steps: 1000000
node_ffn_multiplier: 1.0
edge_ffn_multiplier: 1.0
upto_hop: 16
dataloader_workers: 1 # For multi-process data fetch
scale_degree: true
num_virtual_nodes: 4
svd_random_neg: true
The following is an Error.
Initiated rank: 0
Initiated rank: 3
Initiated rank: 7
Initiated rank: 5
Initiated rank: 2
Initiated rank: 6
Initiated rank: 4
Initiated rank: 1
Traceback (most recent call last):
File "run_training.py", line 6, in <module>
execute('train', config)
File "/remote-home/zyyin/Experiment/OGB/egt_pytorch/lib/training/execute.py", line 87, in execute
torch.multiprocessing.spawn(fn = run_worker,
File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/remote-home/zyyin/anaconda3/envs/grad/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with signal SIGKILL
I think it maybe caused by the limitation of RAM. How can I solve it?
Without DDP, I can run it correctly.
Does it run for the smaller datasets, i.e. MolPCBA and MoHIV (remove the config for pretrained weights)? If so, that might confirm that it's occurring because of memory limitations.
My script actually loads the whole dataset into memory for each process that is spawned (total memory consumed = dataset size x no. of workers) - which is inefficient. I plan on improving it in the future with shared memory among the processes. Or you could try to implement that.
On Thu, Mar 31, 2022 at 9:35 AM Zhangyue Yin @.***> wrote:
Without DDP, I can run it correctly.
— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1#issuecomment-1084586722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XXGZMH5AI5ZYDQWGRTVCWS3RANCNFSM5SCOIPWQ . You are receiving this because you commented.Message ID: @.***>
Another option - you may want to take the model and train it with a different framework such as PyTorch Lightning. Note that, in my code, the model, the dataset, and the training logic are separate - in different directories, you could easily replace the training logic with your own.
On Thu, Mar 31, 2022 at 10:54 AM Shamim Hussain @.***> wrote:
Does it run for the smaller datasets, i.e. MolPCBA and MoHIV (remove the config for pretrained weights)? If so, that might confirm that it's occurring because of memory limitations.
My script actually loads the whole dataset into memory for each process that is spawned (total memory consumed = dataset size x no. of workers) - which is inefficient. I plan on improving it in the future with shared memory among the processes. Or you could try to implement that.
On Thu, Mar 31, 2022 at 9:35 AM Zhangyue Yin @.***> wrote:
Without DDP, I can run it correctly.
— Reply to this email directly, view it on GitHub https://github.com/shamim-hussain/egt_pytorch/issues/1#issuecomment-1084586722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG6XXGZMH5AI5ZYDQWGRTVCWS3RANCNFSM5SCOIPWQ . You are receiving this because you commented.Message ID: @.***>
OK. You are right! The RAM consumption is so large due to each process needs 50GB of RAM. This is unacceptable in my machines. I will try to improve it. Thx~
Hi~ This error occurs when I use Pytorch DDP.
This is my config.