How to build a custom dataset for the graph2seq model?

nashid commented 2 years ago

❓ Questions and Help

I want to use the graph2seq model that would encode an input source code AST as a graph and would decode to a target sequence.

So far I am building the graph model using the GraphData API to build source code AST as a graph. However, how would I feed the sequence data to the decoder?

I don’t find one complete example showing the whole flow. Can anyone point me in the right direction?

AlanSwift commented 2 years ago

Can you please give a more concrete description? Does the step of customizing your dataset trouble you or how to do the decoding when you finished GNN computing?

nashid commented 2 years ago

@AlanSwift I am using the NMT example from graph4nlp.

My input/output is as below:

INPUT: off-by-one dot

OUTPUT: i < array . length

I started following the class IWSLT14Dataset which inherited Text2TextDataset in that example.

I converted the INPUT graph in the picture into a GraphData object. However, the steps are not clear to me and I don’t find a complete example showing the whole flow for feeding custom data. I am finding the steps for customizing my dataset confusing.

nashid commented 2 years ago

@AlanSwift any feedback is well come.

AlanSwift commented 2 years ago

Sorry for the late reply. I think you don't know how to convert your raw inputs to GraphData. An example is the dependency graph which converts the dependency parsing tree to GraphData. Please refer to https://github.com/graph4ai/graph4nlp/blob/master/graph4nlp/pytorch/modules/graph_construction/dependency_graph_construction.py.

nashid commented 2 years ago

@AlanSwift I have already parsed my file and converted that raw input into GraphData. Just to clarify, an example with hard-coded data for illustration:

   def AST2Graph():

        g = GraphData() 

        # add leaf nodes (blue squares T)
        g.add_nodes(36) 

        # add leaf nodes attributes
        g.node_attributes[0]['node_attr'] = 'T1:String'
        g.node_attributes[1]['node_attr'] = 'T2:basicIfElseMethod'
        g.node_attributes[2]['node_attr'] = 'T3:int'
        g.node_attributes[3]['node_attr'] = 'T4:hour'
        ......

        # add non-terminals nodes
        g.add_nodes(32)

        # add attributes to the node terminals 
        g.node_attributes[36]['node_attr'] = 'NT1:MethodDeclaration (lineNumber=2)'
        g.node_attributes[37]['node_attr'] = 'NT2:SimpleType (lineNumber=2)'
        g.node_attributes[38]['node_attr'] = 'NT3:SingleVariableDeclaration (lineNumber=2)'
      ....

        # add the relationships non-terminal (parent) and terminal nodes (leafs/child)- the first 36 relationships
        g.add_edges(
            [36,37,38,39,42,43,43,44,45,45,45,48,48,50,50,51,51,52,53,53,53,56,56,58,58, 59, 59,62, 63,63,65,66,66,66,67,67],
            [1,0,3,2,4,5,6,7,8,9,10,11,12,15,16,13,14,17,18,19,20,21,22,25,26,23,24,27,28,29,30,31,32,33,34,35])

        # add edge attribute = "child"
        for i in range(g.get_edge_num()):
            g.edge_attributes[i]['edge_attr'] = 'child'

        # add the rest of the individual nodes V's, S's
        g.add_nodes(7)

        g.node_attributes[68]['node_attr'] = 'S6:println'
        g.node_attributes[69]['node_attr'] = 'S2:hour'
        g.node_attributes[70]['node_attr'] = 'S7:prefix'
        g.node_attributes[71]['node_attr'] = 'S3:time'
        g.node_attributes[72]['node_attr'] = 'S1:basicIfElseMethod'
        g.node_attributes[73]['node_attr'] = 'S4:System'
        g.node_attributes[74]['node_attr'] = 'S5:out'
        ...

        return g

So far my attempt has been to write a parser called AST2TextDataSet in the similar spirit of Text2TextDataSet.

But from you it sounds like I should write a class like ASTBasedGraphConstruction following DependencyBasedGraphConstruction and overwrite the following methods:

Essentially I have to mimic all the steps of DependencyBasedGraphConstruction. Is this understanding correct? Or is there a simpler way of feeding custom data into the model i.e., I would write the parser to build the GraphData and the rest of the pipeline would work as is?

Also, it is not clear to me whats the role of _graph_connect. It appears _graph_connect is for batching?

AlanSwift commented 2 years ago

R1: The only need is you have to implement the topology function since this API is called during the pipeline. parsing is only an independent "raw string to parsing results" procedure. _construct_static_graph is called after parsing, which takes the parsing outputs and constructs the graph. These two APIs are called in topology. And the dataset will only call the topology API. add_vocab is a deprecated API and will not be used. So you can ignore this feature. R2: Explanation of _graph_connect: We assume the raw inputs are a paragraph consisting of two sentences: "sentence A. sentence B." We will firstly separate sentences and construct a sub-graph for each sentence: sentence A --> subgraph A, sentence B --> subgraph B. Then we will call the _graph_connect and connect these two disjoint subgraphs to one big graph. Currently, we simply add the edges between the tail node of A and the head node of graph B. Notes: batching is another technique. During batching, the separate graphs will never be connected and keep disjoint.

nashid commented 2 years ago

Thanks for your response. A couple of questions:

Q1: for my case, I only have one graph input at a time i.e., I do not have sentence one and sentence two. So I presume I should set the merge_strategy as None and that should do it.
Q2: I have only one graph at a time i.e. no case like sentence A. sentence B. So in my case, no need to implement _graph_connect - right?
Q3: How is the vocabulary generated? Can I learn the embedding (like word2vec) as part of the training?

nashid commented 2 years ago

@AlanSwift I am reading dataset from DOT file format and building my topology that way. Would this generic implementation be of any interest to be added to the graph4nlp project?

AlanSwift commented 2 years ago

Thanks for your response. A couple of questions:

Q1: for my case, I only have one graph input at a time i.e., I do not have sentence one and sentence two. So I presume I should set the merge_strategy as None and that should do it.

Q2: I have only one graph at a time i.e. no case like sentence A. sentence B. So in my case, no need to implement _graph_connect - right?

Q3: How is the vocabulary generated? Can I learn the embedding (like word2vec) as part of the training?

R1 & R2: Your input is a single graph and you don't need a parsing procedure. You should inherit StaticGraphConstructionBase and implement your ASTGraphConstruction. Thus merge_strategy, sequentual_link, _graph_connect can be dropped according to your need. An example is a dataset for the NER graph . R3: Vocabulary is built in dataset. You can refer to the implement.

AlanSwift commented 2 years ago

@AlanSwift I am reading dataset from DOT file format and building my topology that way. Would this generic implementation be of any interest to be added to the graph4nlp project?

Currently, this is not in our plan.

nashid commented 2 years ago

R1 & R2: Your input is a single graph and you don't need a parsing procedure. You should inherit StaticGraphConstructionBase and implement your ASTGraphConstruction. Thus merge_strategy, sequentual_link, _graph_connect can be dropped according to your need. An example is a dataset for the NER graph .

For my case, graph is a static dependency graph and I build the GraphData from the inside of topology method in ASTGraphConstruction. A simplified implementation of ASTGraphConstruction illustrated below:

class ASTGraphConstruction(StaticGraphConstructionBase):
    ….
    def topology(
            cls,
            raw_text_data,
            nlp_processor,
            processor_args,
            merge_strategy,
            edge_strategy,
            sequential_link=True,
            verbose=0,
    ):
        """
            Build graphdata From raw dot file string.
        """               
        ret_graph : GraphData = build_ast_graph_from_dot_string(raw_text_data)   
        return ret_graph

But the challenge is in class Dataset(torch.utils.data.Dataset) the function _build_topology_process is not flexible and it always connects to stanfordcorenlp.StanfordCoreNLP which I do not need for a generic custom dataset.

Furthermore, Dataset expects topology_builder to be IEBasedGraphConstruction, DependencyBasedGraphConstruction or ConstituencyBasedGraphConstruction.

        if graph_type == "static":
            print("Connecting to stanfordcorenlp server...")
            processor = stanfordcorenlp.StanfordCoreNLP(
                "http://localhost", port=port, timeout=timeout
            )

            if topology_builder == IEBasedGraphConstruction:
                ...
                processor_args = [props_coref, props_openie]
            elif topology_builder == DependencyBasedGraphConstruction:
                processor_args = {
                    "annotators": "ssplit,tokenize,depparse",
                    "tokenize.options": "splitHyphenated=false,normalizeParentheses=false,"
                    "normalizeOtherBrackets=false",
                    "tokenize.whitespace": True,
                    "ssplit.isOneSentence": True,
                    "outputFormat": "json",
                }
            elif topology_builder == ConstituencyBasedGraphConstruction:
                processor_args = {
                    "annotators": "tokenize,ssplit,pos,parse",
                    "tokenize.options": "splitHyphenated=false,normalizeParentheses=false,"
                    "normalizeOtherBrackets=false",
                    "tokenize.whitespace": True,
                    "ssplit.isOneSentence": False,
                    "outputFormat": "json",
                }
            else:
                raise NotImplementedError

As a result, my custom topology builder does not get invoked, which results into the NotImplementedError. It appears to me I have to modify this section in the original graph4nlp library and incorporate my custom topology builder ASTGraphConstruction under static graph type.

However, that way it is not customizable at all. In case I am missing something, do let me know. What would be the appropriate implementation?

AlanSwift commented 2 years ago

Yes, this is an urgent plan we are trying to do. As a result, it will be improved in the 0.6 version. Currently, I'm afraid you have to do it by overriding the _build_topology_process function as follows:

class YourDataset(Text2TextDataset): # I guess that the Text2TextDataset is your need

    @staticmethod
    def _build_topology_process(
        data_items,
        topology_builder,
        graph_type,
        dynamic_graph_type,
        dynamic_init_topology_builder,
        merge_strategy,
        edge_strategy,
        dynamic_init_topology_aux_args,
        lower_case,
        tokenizer,
        port,
        timeout,
    ):
   # call your topology builder to construct the AST graph

We are sorry for this inconvenience. The experience will be improved in the next two versions.

AlanSwift commented 2 years ago

This issue will be closed. Please feel free to reopen if necessary.

graph4ai / graph4nlp

How to build a custom dataset for the graph2seq model? #460

❓ Questions and Help