how to convert dependency parsing into Bert-gt neighbors for Nary Cross-sentence RE

ncbi / bert_gt

Other

31 stars 4 forks source link

how to convert dependency parsing into Bert-gt neighbors for Nary Cross-sentence RE #2

Closed juliachen123 closed 3 years ago

juliachen123 commented 3 years ago

Hi there, thank you very much for making the code and data available! I am interested in testing the model with my own dataset for Nary cross-sentence RE task. I am wondering what are the functions in convert_nary_2_bert_gt focus on converting the dependency (arcs/edges) info that original dataset have into the bert_gt neighbors column content. or if you could elaborate the concept how to do such conversion, that would be much appreciated. Thank you in advance!

ptlai commented 3 years ago

Hello julia78118,

Thank you for your interest in our work, and I am sorry for the late response.

I am not sure if I get your question or not. In the nary task, the original dataset already provides the neighbors of each token. We observed that the compared previous works utilized the information of neighbors directly. Therefore, we utilized the neighbors provided by the XML files of the nary dataset directly. So the script convert_nary_2_bert_gt.py aims to convert the XML file into our TSV file format, which is almost the same as the original Bert's input file but with an additional column of neighbor information from the XML files. Please let me know if it is not what you asked.

Thank you!

juliachen123 commented 3 years ago

Hi @ptlai,

Thank you for getting back to me!

The arcs in the original nary dataset includes toIndex and index. I was wondering if you could elaborate how to interpret this (1|2 0|2 0|1|3|9 2|4 ... as the neighbors, converted from arcs, in your processed nary dataset) for example. I tried to look it up in the paper and I apologize in advance if I missed it.

Thank you!

ptlai commented 3 years ago

Hi @julia78118

The number means the indexes of the neighbor tokens, which have in-edges connecting to the current token. The toIndex of arcs represents the out-edges of current tokens in the original dataset. I make a conversion for these out-edges into in-edges to know which neighbors point to the current token. Please let me know if there are still any questions. Thanks!

juliachen123 commented 3 years ago

Hi @ptlai ,

Thank you for the clarification! Those numbers make sense now.

ptlai commented 3 years ago

You're welcome! @julia78118