graph4ai / graph4nlp

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP. Welcome to visit our DLG4NLP website (https://dlg4nlp.github.io/index.html) for various learning resources!
Apache License 2.0
1.67k stars 201 forks source link

How to add an edge with attribute value? #528

Closed nashid closed 2 years ago

nashid commented 2 years ago

❓ Questions and Help

How to add an edge with attribute value?

    def add_edge(self, src: int, tgt: int):
        """
        Add one edge to the graph.

        Parameters
        ----------
        src : int
            Source node index
        tgt : int
            Target node index

Currently, I can only add an edge between two nodes using their node index. But how to set the edge_type? I cant find any API like add_edge(self, src: int, tgt: int, edge_type: str).

Can anyone please provide a pointer or code snippet?

SaizhuoWang commented 2 years ago

Hi. Currently there's no direct interface for adding edge with attributes. You may add edges first and then modify edge_attributes. For example:

g = GraphData()
g.add_nodes(3)
g.add_edges([1,2], [2,3])
g.edge_attributes[0] = {'type': 'r1'}
g.edge_attributes[1] = {'type': 'r2'}

edge_attributes are essentially list of dictionaries.

nashid commented 2 years ago

@SaizhuoWang thanks for the pointers. I explored further and my findings:

The rest of the attributes are dropped by the pipeline during training. Is this understanding correct?

SaizhuoWang commented 2 years ago

Yes, your finding is mostly right. For both nodes and edges, we have 2 parallel ways of storing the related info: features and attributes. Features are for numerical info, namely torch.Tensors. It's a dictionary with keys being feature name and values being the corresonding tensor. Attributes are lists of dictionaries where each dictionary corresponds to one node/edge and is independent of each other. You may add literally everything to the attribute dictionary as long as it can be represented as a KV pair. When performing computation, DGL is involved and there's conversion from graph4nlp.GraphData to dgl.DGLGraph. Through this conversion only features are kept and attributes are not passed to the corresponding DGLGraph since dgl only supports tensor-ized data.

SaizhuoWang commented 2 years ago

As for the name token, it's just a convention of naming since the most frequently used info related to a node/edge in NLP context is token.

nashid commented 2 years ago

@SaizhuoWang thanks for your response. I am trying to understand the flow:

Would really appreciate it if you share your insight.

SaizhuoWang commented 2 years ago

Thanks for your interest. To clarify this, let's review the general workflow of graph4nlp, which can be roughly presented as: Graph Construction -> Graph Encoding (that's where GNN is involved) -> Graph Decoding

dgl, as a GNN library, is only involved in the Graph Encoding(GNN) stage. So the former and latter stages still use GraphData. attributes are dropped when converting to DGLGraph but the original GraphData still holds this information. And both Graph Construction and Graph Decoding have the demand of attributes. For example, in the graph construction stage the embedding part involves taking the original tokens to build vocab and learning embddings. In the decoding stage, for example some natural language generation tasks, the original tokens may still be needed.

If you have any further problems, please LMK. Thanks.

nashid commented 2 years ago

@SaizhuoWang thanks for the clarification. I am using graph4nlp for a Neural Machine Translation (NMT) task.

I used single_token_item for each node and sequential_link is set to true. So it is like where tokenN-1 and tokenN is connected by an edge:

token1 - token2 - … - token(N-1) - tokenN

SaizhuoWang commented 2 years ago

In your example, the original token is stored in the node_attributes and the token embedding which is torch.Tensor is stored in node_features. At the interface of Graph Constrcution -> GCN, node_features are passed from GraphData to DGLGraph. So the token embeddings are passed but the original tokens are not passed to the DGLGraph.

nashid commented 2 years ago

@SaizhuoWang thanks for the clarification.

But I see there are attributes like 'type' which is used when edge_strategy == "as_node". Is it mandatory that I have to set type when I am building a levi-graph. Or as long as I set the node_attributes[node_idx]['token'] =edge_type` it does not matter.

Thanks for the clarification.

SaizhuoWang commented 2 years ago

Thanks for your reply. There are no built-in mandatory requirements on any of the node_attributes or edge_attributes. There do exist some default names as you may find in https://github.com/graph4ai/graph4nlp/blob/d980e897131f1b9d3766750c06316d94749904fa/graph4nlp/pytorch/data/data.py#L40 . But the naming conventions are specific to different examples and use cases. Please find out the required attribute names or data based on your own use case.

nashid commented 2 years ago

@SaizhuoWang still trying to understand 😓

What I see from the code, token is mandatory. for example:

https://github.com/graph4ai/graph4nlp/blob/master/graph4nlp/pytorch/data/dataset.py#L109

    def extract(self):
        """
        Returns
        -------
        Input tokens and output tokens
        """
        g: GraphData = self.graph

        input_tokens = []
        for i in range(g.get_node_num()):
            if self.tokenizer is None:
                tokenized_token = g.node_attributes[i]["token"].strip().split(" ")
            else:
                tokenized_token = self.tokenizer(g.node_attributes[i]["token"])

            input_tokens.extend(tokenized_token)

        if self.tokenizer is None:
            output_tokens = self.output_text.strip().split(" ")
        else:
            output_tokens = self.tokenizer(self.output_text)

Code always expects the text inside a node in the GraphData in the token node_attribute.