hfthair / emerald_crawler

10 stars 2 forks source link

Question about experiments #3

Closed JaniceXiong closed 3 years ago

JaniceXiong commented 3 years ago

In the paper《Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents》, I find that models take various content as input, like Full, IC, Intro, Method, Result, Conclu ... image

And as written in Appendix A, we can use keywords to identify paper sections. But After I preprocess the crawled data, I find that not all the data has complete sections. For example, this item in test set doesn't have Introduction and Conclusion section , so IC is also empty. image

So I wonder whether the experiment only done on the data with this field, like 5000/6000 in test set has Intro section, so only calculate the result of these 5000 data.

And could you release some preprocess code to ensure that the data used in model is the same?

hfthair commented 3 years ago

Hi JaniceXiong, We use the 1st section if the Intro section is missing, and use the last section if the Conlcu section is missing. I have uploaded the preprocess code to /notebooks/preprocess_jsonl2pairs.ipynb, the notebook generates source target pairs from train|dev|test.jsonl. I hope that will help.