Closed nashid closed 2 years ago
Hi. Currently there's no direct interface for adding edge with attributes. You may add edges first and then modify edge_attributes
. For example:
g = GraphData()
g.add_nodes(3)
g.add_edges([1,2], [2,3])
g.edge_attributes[0] = {'type': 'r1'}
g.edge_attributes[1] = {'type': 'r2'}
edge_attributes
are essentially list of dictionaries.
@SaizhuoWang thanks for the pointers. I explored further and my findings:
g.node_attributes[i]['token']
is used.g.edge_attributes[i]['token']
is used. (But currently, it is not used as the heterogeneous graph is not supported yet.)The rest of the attributes are dropped by the pipeline during training. Is this understanding correct?
Yes, your finding is mostly right. For both nodes and edges, we have 2 parallel ways of storing the related info: features and attributes. Features are for numerical info, namely torch.Tensor
s. It's a dictionary with keys being feature name and values being the corresonding tensor. Attributes are lists of dictionaries where each dictionary corresponds to one node/edge and is independent of each other. You may add literally everything to the attribute dictionary as long as it can be represented as a KV pair.
When performing computation, DGL is involved and there's conversion from graph4nlp.GraphData
to dgl.DGLGraph
. Through this conversion only features
are kept and attributes
are not passed to the corresponding DGLGraph
since dgl
only supports tensor-ized data.
As for the name token
, it's just a convention of naming since the most frequently used info related to a node/edge in NLP context is token
.
@SaizhuoWang thanks for your response. I am trying to understand the flow:
Firstly, if all attributes
are dropped, then what's the point of building the attribute dictionary
in the first place?
During debugging, I saw build_vocab()
in dataset.py
reads from node_attributes
(g.node_attributes[i]["token"]
) to build the vocab mode. So is it like embedding
step is used to learn the embedding for the words and then this embedding values are set to the features
for graph4nlp.GraphData
. Following that graph4nlp.GraphData
is converted to the DGL dgl.DGLGraph
.
Would really appreciate it if you share your insight.
Thanks for your interest. To clarify this, let's review the general workflow of graph4nlp
, which can be roughly presented as:
Graph Construction -> Graph Encoding (that's where GNN is involved) -> Graph Decoding
dgl
, as a GNN library, is only involved in the Graph Encoding(GNN) stage. So the former and latter stages still use GraphData
. attributes
are dropped when converting to DGLGraph
but the original GraphData
still holds this information. And both Graph Construction and Graph Decoding have the demand of attributes
. For example, in the graph construction stage the embedding
part involves taking the original tokens to build vocab and learning embddings. In the decoding stage, for example some natural language generation tasks, the original tokens may still be needed.
If you have any further problems, please LMK. Thanks.
@SaizhuoWang thanks for the clarification. I am using graph4nlp
for a Neural Machine Translation (NMT) task.
I used single_token_item
for each node and sequential_link
is set to true. So it is like where tokenN-1
and tokenN
is connected by an edge:
token1 - token2 - … - token(N-1) - tokenN
Based on my understanding, during Graph Construction
the embedding generation step would learn the initial embedding (let's say word2vec
) for the nodes.
Following that let's say GCN
would start with this initial word2vec
embedding.
But I am confused here as you said while invoking DGLGraph
information is dropped. The initially learned embedding must be passed to the underlying DGL in some way - isn't it? How the initial graph embedding is used?
In your example, the original token is stored in the node_attributes
and the token embedding which is torch.Tensor
is stored in node_features
. At the interface of Graph Constrcution -> GCN
, node_features
are passed from GraphData
to DGLGraph
. So the token embeddings are passed but the original tokens are not passed to the DGLGraph
.
@SaizhuoWang thanks for the clarification.
build_vocab()
in dataset.py
reads from node_attributes (g.node_attributes[i]["token"])
and build the vocab model.Graph Construction -> GCN
, node_features
from GraphData
are passed to DGLGraph.But I see there are attributes like 'type'
which is used when edge_strategy == "as_node"
. Is it mandatory that I have to set type
when I am building a levi-graph. Or as long as I set the node_attributes[node_idx]['token'] =
edge_type` it does not matter.
Thanks for the clarification.
Thanks for your reply. There are no built-in mandatory requirements on any of the node_attributes
or edge_attributes
. There do exist some default names as you may find in https://github.com/graph4ai/graph4nlp/blob/d980e897131f1b9d3766750c06316d94749904fa/graph4nlp/pytorch/data/data.py#L40 . But the naming conventions are specific to different examples and use cases. Please find out the required attribute names or data based on your own use case.
@SaizhuoWang still trying to understand 😓
What I see from the code, token
is mandatory. for example:
https://github.com/graph4ai/graph4nlp/blob/master/graph4nlp/pytorch/data/dataset.py#L109
def extract(self):
"""
Returns
-------
Input tokens and output tokens
"""
g: GraphData = self.graph
input_tokens = []
for i in range(g.get_node_num()):
if self.tokenizer is None:
tokenized_token = g.node_attributes[i]["token"].strip().split(" ")
else:
tokenized_token = self.tokenizer(g.node_attributes[i]["token"])
input_tokens.extend(tokenized_token)
if self.tokenizer is None:
output_tokens = self.output_text.strip().split(" ")
else:
output_tokens = self.tokenizer(self.output_text)
Code always expects the text inside a node in the GraphData
in the token
node_attribute
.
❓ Questions and Help
How to add an edge with attribute value?
Currently, I can only add an edge between two nodes using their node index. But how to set the
edge_type
? I cant find any API likeadd_edge(self, src: int, tgt: int, edge_type: str)
.Can anyone please provide a pointer or code snippet?