sebastianGehrmann / bottom-up-summary

BSD 3-Clause "New" or "Revised" License
178 stars 43 forks source link

Have trouble in generating content selection training data #20

Closed JJJJane closed 4 years ago

JJJJane commented 4 years ago

Hi, As mentioned in the title, I would like to know how you tag the training data in the content selection step. I understand that you do it by aligning the summaries to the document, but I just couldn't reproduce the training data on my own by using the method described in the paper. And I am a bit confused about your labeled data. Here's one example: The original document is: '-lrb- cnn -rrb- relations between iran and saudi arabia have always been thorny , but rarely has the state of affairs been as venomous as it is today . tehran and riyadh each point to the other as the main reason for much of the turmoil in the middle east . in its most recent incarnation , the iranian-saudi conflict by proxy has reached yemen in a spiral that both sides portray as climatic . for riyadh and its regional allies , the saudi military intervention in yemen -- operation decisive storm '' -- is the moment the sunni arab nation finally woke up to repel the expansion of shia-iranian influence . for tehran and its regional allies -- including the houthi movement in yemen -- saudi arabia \'s actions are in defense of a retrogressive status quo order that is no longer tenable . and yet both sides have good reasons to want to stop the yemeni crisis from spiraling out of control and evolving into an unwinnable war . when iranian president hassan rouhani was elected in june 2013 , he pledged to reach out to riyadh . he was up front and called tehran \'s steep deterioration of relations with the saudis over the last decade as one of the principal burdens on iranian foreign policy . from lebanon and afghanistan to pakistan and the gaza strip , the iranian-saudi rivalry and conflict through proxy has been deep and costly . and yet despite rouhani \'s open pledge , profound differences over syria and iraq in particular have kept riyadh and tehran apart . but if the questions of syria and iraq prevented a pause in hostilities , the saudi military intervention in yemen since late march has all but raised the stakes to unprecedentedly dangerous levels . unlike in syria and in iraq , the saudi military is now directly battling it out with iranian-backed rebels in yemen . while riyadh no doubt exaggerates tehran \'s role in the yemen crisis , its fingerprints are nonetheless evident . iran provides financial support , weapons , training and intelligence to houthis , '' gerald feierstein , a u.s. state department official and former yemen ambassador , told a congressional hearing last week . `` we believe that iran sees opportunities with the houthis to expand its influence in yemen and threaten saudi and gulf arab'

and the ground-truth summary is: ' vatanka : tensions between iran and saudi arabia are at an unprecedented level . iran has proposed a four-point plan for yemen but saudis have ignored it . vatanka : saudis have tried to muster a ground invasion coalition but have failed . '

the tagged data provided in this repo is: 'between iran and saudi arabia have but has as it is . and point to for in yemen a saudi arab saudi arabia are no an saudis on iran we'

But you can see the word 'we' is not in the summary, then why it is tagged 1. Anyway, could you please provide the code which tags the training data? That would really help me a lot. Thanks!

JJJJane commented 4 years ago

I noticed that the labeled data are generated by preprocess_copy.py. Sorry for my carelessness.