Closed nashid closed 2 years ago
Can you please give a more concrete description? Does the step of customizing your dataset trouble you or how to do the decoding when you finished GNN computing?
@AlanSwift I am using the NMT example from graph4nlp.
My input/output is as below:
INPUT:
OUTPUT: i < array . length
I started following the class IWSLT14Dataset
which inherited Text2TextDataset
in that example.
I converted the INPUT graph in the picture into a GraphData object. However, the steps are not clear to me and I don’t find a complete example showing the whole flow for feeding custom data. I am finding the steps for customizing my dataset confusing.
@AlanSwift any feedback is well come.
Sorry for the late reply. I think you don't know how to convert your raw inputs to GraphData. An example is the dependency graph which converts the dependency parsing tree to GraphData. Please refer to https://github.com/graph4ai/graph4nlp/blob/master/graph4nlp/pytorch/modules/graph_construction/dependency_graph_construction.py.
@AlanSwift I have already parsed my file and converted that raw input into GraphData
. Just to clarify, an example with hard-coded data for illustration:
def AST2Graph():
g = GraphData()
# add leaf nodes (blue squares T)
g.add_nodes(36)
# add leaf nodes attributes
g.node_attributes[0]['node_attr'] = 'T1:String'
g.node_attributes[1]['node_attr'] = 'T2:basicIfElseMethod'
g.node_attributes[2]['node_attr'] = 'T3:int'
g.node_attributes[3]['node_attr'] = 'T4:hour'
......
# add non-terminals nodes
g.add_nodes(32)
# add attributes to the node terminals
g.node_attributes[36]['node_attr'] = 'NT1:MethodDeclaration (lineNumber=2)'
g.node_attributes[37]['node_attr'] = 'NT2:SimpleType (lineNumber=2)'
g.node_attributes[38]['node_attr'] = 'NT3:SingleVariableDeclaration (lineNumber=2)'
....
# add the relationships non-terminal (parent) and terminal nodes (leafs/child)- the first 36 relationships
g.add_edges(
[36,37,38,39,42,43,43,44,45,45,45,48,48,50,50,51,51,52,53,53,53,56,56,58,58, 59, 59,62, 63,63,65,66,66,66,67,67],
[1,0,3,2,4,5,6,7,8,9,10,11,12,15,16,13,14,17,18,19,20,21,22,25,26,23,24,27,28,29,30,31,32,33,34,35])
# add edge attribute = "child"
for i in range(g.get_edge_num()):
g.edge_attributes[i]['edge_attr'] = 'child'
# add the rest of the individual nodes V's, S's
g.add_nodes(7)
g.node_attributes[68]['node_attr'] = 'S6:println'
g.node_attributes[69]['node_attr'] = 'S2:hour'
g.node_attributes[70]['node_attr'] = 'S7:prefix'
g.node_attributes[71]['node_attr'] = 'S3:time'
g.node_attributes[72]['node_attr'] = 'S1:basicIfElseMethod'
g.node_attributes[73]['node_attr'] = 'S4:System'
g.node_attributes[74]['node_attr'] = 'S5:out'
...
return g
So far my attempt has been to write a parser called AST2TextDataSet
in the similar spirit of Text2TextDataSet
.
But from you it sounds like I should write a class like ASTBasedGraphConstruction
following DependencyBasedGraphConstruction
and overwrite the following methods:
Essentially I have to mimic all the steps of DependencyBasedGraphConstruction
. Is this understanding correct? Or is there a simpler way of feeding custom data into the model i.e., I would write the parser to build the GraphData and the rest of the pipeline would work as is?
Also, it is not clear to me whats the role of _graph_connect
. It appears _graph_connect
is for batching?
R1: The only need is you have to implement the topology
function since this API is called during the pipeline.
parsing
is only an independent "raw string to parsing results" procedure.
_construct_static_graph
is called after parsing
, which takes the parsing outputs and constructs the graph.
These two APIs are called in topology
. And the dataset will only call the topology
API.
add_vocab
is a deprecated API and will not be used. So you can ignore this feature.
R2: Explanation of _graph_connect
: We assume the raw inputs are a paragraph consisting of two sentences: "sentence A. sentence B." We will firstly separate sentences and construct a sub-graph for each sentence: sentence A --> subgraph A, sentence B --> subgraph B. Then we will call the _graph_connect
and connect these two disjoint subgraphs to one big graph. Currently, we simply add the edges between the tail node of A and the head node of graph B.
Notes: batching
is another technique. During batching, the separate graphs will never be connected and keep disjoint.
Thanks for your response. A couple of questions:
Q1: for my case, I only have one graph input at a time i.e., I do not have sentence one and sentence two. So I presume I should set the merge_strategy as None and that should do it.
Q2: I have only one graph at a time i.e. no case like sentence A. sentence B.
So in my case, no need to implement _graph_connect
- right?
Q3: How is the vocabulary generated? Can I learn the embedding (like word2vec) as part of the training?
@AlanSwift I am reading dataset from DOT file format and building my topology that way. Would this generic implementation be of any interest to be added to the graph4nlp project?
Thanks for your response. A couple of questions:
- Q1: for my case, I only have one graph input at a time i.e., I do not have sentence one and sentence two. So I presume I should set the merge_strategy as None and that should do it.
- Q2: I have only one graph at a time i.e. no case like
sentence A. sentence B.
So in my case, no need to implement_graph_connect
- right?- Q3: How is the vocabulary generated? Can I learn the embedding (like word2vec) as part of the training?
R1 & R2: Your input is a single graph and you don't need a parsing procedure. You should inherit StaticGraphConstructionBase
and implement your ASTGraphConstruction. Thus merge_strategy
, sequentual_link
, _graph_connect
can be dropped according to your need. An example is a dataset for the NER graph .
R3: Vocabulary is built in dataset. You can refer to the implement.
@AlanSwift I am reading dataset from DOT file format and building my topology that way. Would this generic implementation be of any interest to be added to the graph4nlp project?
Currently, this is not in our plan.
R1 & R2: Your input is a single graph and you don't need a parsing procedure. You should inherit
StaticGraphConstructionBase
and implement your ASTGraphConstruction. Thusmerge_strategy
,sequentual_link
,_graph_connect
can be dropped according to your need. An example is a dataset for the NER graph .
For my case, graph is a static dependency graph and I build the GraphData from the inside of topology
method in ASTGraphConstruction
. A simplified implementation of ASTGraphConstruction
illustrated below:
class ASTGraphConstruction(StaticGraphConstructionBase):
….
def topology(
cls,
raw_text_data,
nlp_processor,
processor_args,
merge_strategy,
edge_strategy,
sequential_link=True,
verbose=0,
):
"""
Build graphdata From raw dot file string.
"""
ret_graph : GraphData = build_ast_graph_from_dot_string(raw_text_data)
return ret_graph
But the challenge is in class Dataset(torch.utils.data.Dataset)
the function _build_topology_process
is not flexible and it always connects to stanfordcorenlp.StanfordCoreNLP
which I do not need for a generic custom dataset.
Furthermore, Dataset
expects topology_builder
to be IEBasedGraphConstruction
, DependencyBasedGraphConstruction
or ConstituencyBasedGraphConstruction
.
if graph_type == "static":
print("Connecting to stanfordcorenlp server...")
processor = stanfordcorenlp.StanfordCoreNLP(
"http://localhost", port=port, timeout=timeout
)
if topology_builder == IEBasedGraphConstruction:
...
processor_args = [props_coref, props_openie]
elif topology_builder == DependencyBasedGraphConstruction:
processor_args = {
"annotators": "ssplit,tokenize,depparse",
"tokenize.options": "splitHyphenated=false,normalizeParentheses=false,"
"normalizeOtherBrackets=false",
"tokenize.whitespace": True,
"ssplit.isOneSentence": True,
"outputFormat": "json",
}
elif topology_builder == ConstituencyBasedGraphConstruction:
processor_args = {
"annotators": "tokenize,ssplit,pos,parse",
"tokenize.options": "splitHyphenated=false,normalizeParentheses=false,"
"normalizeOtherBrackets=false",
"tokenize.whitespace": True,
"ssplit.isOneSentence": False,
"outputFormat": "json",
}
else:
raise NotImplementedError
As a result, my custom topology builder does not get invoked, which results into the NotImplementedError
. It appears to me I have to modify this section in the original graph4nlp
library and incorporate my custom topology builder ASTGraphConstruction
under static
graph type.
However, that way it is not customizable at all. In case I am missing something, do let me know. What would be the appropriate implementation?
Yes, this is an urgent plan we are trying to do. As a result, it will be improved in the 0.6 version. Currently, I'm afraid you have to do it by overriding the _build_topology_process function as follows:
class YourDataset(Text2TextDataset): # I guess that the Text2TextDataset is your need
@staticmethod
def _build_topology_process(
data_items,
topology_builder,
graph_type,
dynamic_graph_type,
dynamic_init_topology_builder,
merge_strategy,
edge_strategy,
dynamic_init_topology_aux_args,
lower_case,
tokenizer,
port,
timeout,
):
# call your topology builder to construct the AST graph
We are sorry for this inconvenience. The experience will be improved in the next two versions.
This issue will be closed. Please feel free to reopen if necessary.
❓ Questions and Help
I want to use the
graph2seq
model that would encode an input source code AST as a graph and would decode to a target sequence.So far I am building the graph model using the GraphData API to build source code AST as a graph. However, how would I feed the sequence data to the decoder?
I don’t find one complete example showing the whole flow. Can anyone point me in the right direction?