Running machine translation using different GNNs

smith-co commented 2 years ago

❓ Questions and Help

I am running the NMT example on the same dataset with GNN variants:

GCN
GGNN
GraphSage

While the execution runs with GCN, I get Out-of-Memory (OOM) for GGNN and GraphSage. Can anyone help me with this query?

AlanSwift commented 2 years ago

Please try a smaller batch_size or try another GPU with larger memory.

smith-co commented 2 years ago

@AlanSwift I already tried a smaller batch size. What I find suprising is:

It runs for GCN and GAT.
But it gives Out-of-Memory (OOM) for GGNN and GraphSage.

Its the same dataset. But GGNN and GraphSage fails to run while GCN and GAT works.

So GGNN/GraphSage needs more resource for some reason? Super interested to know why?

AlanSwift commented 2 years ago

We haven't investigated the memory efficiency for dgl :). It seems that GGNN and GraphSage need more GPU memory.

smith-co commented 2 years ago

@AlanSwift I get this OOM error at runtime for GGNN:

  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 226, in forward
    return self.encoder_decoder(batch_graph=batch_graph, oov_dict=oov_dict, tgt_seq=tgt_seq)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 173, in encoder_decoder
    batch_graph = self.gnn_encoder(batch_graph)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 557, in forward
    h = self.models(dgl_graph, (feat_in, feat_out), etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 442, in forward
    return self.model(graph, node_feats, etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 210, in forward
    graph_in.apply_edges(
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/heterograph.py", line 4300, in apply_edges
    edata = core.invoke_edge_udf(g, eid, etype, func)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/core.py", line 85, in invoke_edge_udf
    return func(ebatch)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 212, in <lambda>
    "W_e*h": self.linears_in[i](edges.src["h"])
RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 3; 14.76 GiB total capacity; 11.83 GiB already allocated; 447.75 MiB free; 12.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea?

smith-co commented 2 years ago

@AlanSwift came across this discussion at DGL: Memory consumption of the GGNN module

AlanSwift commented 2 years ago

It seems the dgl sacrifices memory efficiency for time efficiency. We will pay attention to this problem. Thank you for letting us know it!

smith-co commented 2 years ago

@AlanSwift can you please provide me with a fix/suggestion 🙏

nashid commented 2 years ago

@AlanSwift, this is interesting. I also faced the same problem. Wondering do you have any solution to this?

nashid commented 2 years ago

@AlanSwift do you have a plan to address the GGNN implementation limitation?

AlanSwift commented 2 years ago

Currently, this is not on my plan since it is related to the DGL.

graph4ai / graph4nlp

Running machine translation using different GNNs #536

❓ Questions and Help