How to add an edge with attribute value?

nashid commented 2 years ago

❓ Questions and Help

    def add_edge(self, src: int, tgt: int):
        """
        Add one edge to the graph.

        Parameters
        ----------
        src : int
            Source node index
        tgt : int
            Target node index

Currently, I can only add an edge between two nodes using their node index. But how to set the edge_type? I cant find any API like add_edge(self, src: int, tgt: int, edge_type: str).

Can anyone please provide a pointer or code snippet?

SaizhuoWang commented 2 years ago

Hi. Currently there's no direct interface for adding edge with attributes. You may add edges first and then modify edge_attributes. For example:

g = GraphData()
g.add_nodes(3)
g.add_edges([1,2], [2,3])
g.edge_attributes[0] = {'type': 'r1'}
g.edge_attributes[1] = {'type': 'r2'}

edge_attributes are essentially list of dictionaries.

nashid commented 2 years ago

@SaizhuoWang thanks for the pointers. I explored further and my findings:

for nodes, only the g.node_attributes[i]['token'] is used.
for edges, only the g.edge_attributes[i]['token'] is used. (But currently, it is not used as the heterogeneous graph is not supported yet.)

The rest of the attributes are dropped by the pipeline during training. Is this understanding correct?

SaizhuoWang commented 2 years ago

Yes, your finding is mostly right. For both nodes and edges, we have 2 parallel ways of storing the related info: features and attributes. Features are for numerical info, namely torch.Tensors. It's a dictionary with keys being feature name and values being the corresonding tensor. Attributes are lists of dictionaries where each dictionary corresponds to one node/edge and is independent of each other. You may add literally everything to the attribute dictionary as long as it can be represented as a KV pair. When performing computation, DGL is involved and there's conversion from graph4nlp.GraphData to dgl.DGLGraph. Through this conversion only features are kept and attributes are not passed to the corresponding DGLGraph since dgl only supports tensor-ized data.

SaizhuoWang commented 2 years ago

As for the name token, it's just a convention of naming since the most frequently used info related to a node/edge in NLP context is token.

nashid commented 2 years ago

@SaizhuoWang thanks for your response. I am trying to understand the flow:

Firstly, if all attributes are dropped, then what's the point of building the attribute dictionary in the first place?
During debugging, I saw build_vocab() in dataset.py reads from node_attributes (g.node_attributes[i]["token"]) to build the vocab mode. So is it like embedding step is used to learn the embedding for the words and then this embedding values are set to the features for graph4nlp.GraphData. Following that graph4nlp.GraphData is converted to the DGL dgl.DGLGraph.

Would really appreciate it if you share your insight.

SaizhuoWang commented 2 years ago

Thanks for your interest. To clarify this, let's review the general workflow of graph4nlp, which can be roughly presented as: Graph Construction -> Graph Encoding (that's where GNN is involved) -> Graph Decoding

dgl, as a GNN library, is only involved in the Graph Encoding(GNN) stage. So the former and latter stages still use GraphData. attributes are dropped when converting to DGLGraph but the original GraphData still holds this information. And both Graph Construction and Graph Decoding have the demand of attributes. For example, in the graph construction stage the embedding part involves taking the original tokens to build vocab and learning embddings. In the decoding stage, for example some natural language generation tasks, the original tokens may still be needed.

If you have any further problems, please LMK. Thanks.

nashid commented 2 years ago

@SaizhuoWang thanks for the clarification. I am using graph4nlp for a Neural Machine Translation (NMT) task.

I used single_token_item for each node and sequential_link is set to true. So it is like where tokenN-1 and tokenN is connected by an edge:

token1 - token2 - … - token(N-1) - tokenN

Based on my understanding, during Graph Construction the embedding generation step would learn the initial embedding (let's say word2vec) for the nodes.
Following that let's say GCN would start with this initial word2vec embedding.
But I am confused here as you said while invoking DGLGraph information is dropped. The initially learned embedding must be passed to the underlying DGL in some way - isn't it? How the initial graph embedding is used?

SaizhuoWang commented 2 years ago

In your example, the original token is stored in the node_attributes and the token embedding which is torch.Tensor is stored in node_features. At the interface of Graph Constrcution -> GCN, node_features are passed from GraphData to DGLGraph. So the token embeddings are passed but the original tokens are not passed to the DGLGraph.

nashid commented 2 years ago

@SaizhuoWang thanks for the clarification.

So build_vocab() in dataset.py reads from node_attributes (g.node_attributes[i]["token"]) and build the vocab model.
During the Graph Construction -> GCN, node_features from GraphData are passed to DGLGraph.

But I see there are attributes like 'type' which is used when edge_strategy == "as_node". Is it mandatory that I have to set type when I am building a levi-graph. Or as long as I set the node_attributes[node_idx]['token'] =edge_type` it does not matter.

Thanks for the clarification.

SaizhuoWang commented 2 years ago

Thanks for your reply. There are no built-in mandatory requirements on any of the node_attributes or edge_attributes. There do exist some default names as you may find in https://github.com/graph4ai/graph4nlp/blob/d980e897131f1b9d3766750c06316d94749904fa/graph4nlp/pytorch/data/data.py#L40 . But the naming conventions are specific to different examples and use cases. Please find out the required attribute names or data based on your own use case.

nashid commented 2 years ago

@SaizhuoWang still trying to understand 😓

What I see from the code, token is mandatory. for example:

https://github.com/graph4ai/graph4nlp/blob/master/graph4nlp/pytorch/data/dataset.py#L109

    def extract(self):
        """
        Returns
        -------
        Input tokens and output tokens
        """
        g: GraphData = self.graph

        input_tokens = []
        for i in range(g.get_node_num()):
            if self.tokenizer is None:
                tokenized_token = g.node_attributes[i]["token"].strip().split(" ")
            else:
                tokenized_token = self.tokenizer(g.node_attributes[i]["token"])

            input_tokens.extend(tokenized_token)

        if self.tokenizer is None:
            output_tokens = self.output_text.strip().split(" ")
        else:
            output_tokens = self.tokenizer(self.output_text)

Code always expects the text inside a node in the GraphData in the token node_attribute.

graph4ai / graph4nlp

How to add an edge with attribute value? #528

❓ Questions and Help