Closed dojoteef closed 3 years ago
Hi @dojoteef. Thanks for raising this issue. I'm looking into this at the moment. The regexes that we use to splitting the aggregate summaries are not exhaustive and at times we had to manually intervene and modify the lines in the text, rather than having a regex for every outlier scenario. I am adding some more checks to try and split as many book chapters as possible from this list you've shared. Thanks again!
My latest commit should've fixed this issue. I have tested out the fix for all sources, and should help with splitting some of the book chapters that we were not able to get before.
Thanks so much for looking into all these issues I've brought up! I've shifted focus for now, but will be coming back to the BookSum dataset in the near future and will try to see if I encounter any additional blockers.
Thanks again for all the help!
I've been trying to diagnose why I have missing data and part of the problem appears to be in the separate_multiple_summaries function. The end result, is that the script doesn't split some books which are expected to be split in the provided chapter-level-summary-alignments.
An example of this can behavior can be seen by stepping through the splitting of A Room With a View from gradesaver. It turns out that the script doesn't account for the
<PARAGRAPH>
tags, despite a comment in the source which states that it should.While stepping through the function, you can see that the regex splits the text into lines like so:
Then the first preprocessing function in the loop, remove_prefixes_line, simply takes off the first
<
due to split_aggregate_chaps_all_sources.py:276, which strips all leading punctuation. The resulting line that starts with:PARAGRAPH>Chapter Two In Santa Croce with No Baedeker:
doesn't match the regex, which expects thechapter
marker to be at the beginning of the string.This splitting issue (maybe there are more issues with splitting, but this is the one I investigated) causes a number of books to fail to split. Here's the list of books that the data collection script downloaded, but failed to properly split for me: