salesforce / booksum

BSD 3-Clause "New" or "Revised" License
184 stars 31 forks source link

Incorrect behavior in separate_mulitple_summaries function #16

Closed dojoteef closed 3 years ago

dojoteef commented 3 years ago

I've been trying to diagnose why I have missing data and part of the problem appears to be in the separate_multiple_summaries function. The end result, is that the script doesn't split some books which are expected to be split in the provided chapter-level-summary-alignments.

An example of this can behavior can be seen by stepping through the splitting of A Room With a View from gradesaver. It turns out that the script doesn't account for the <PARAGRAPH> tags, despite a comment in the source which states that it should.

While stepping through the function, you can see that the regex splits the text into lines like so:

<PARAGRAPH>Chapter Two In Santa Croce with No Baedeker:<PARAGRAPH>Summary:<PARAGRAPH>Lucy looks out her window onto the beautiful scene of a Florence morning

  Then the first preprocessing function in the loop, remove_prefixes_line, simply takes off the first < due to split_aggregate_chaps_all_sources.py:276, which strips all leading punctuation. The resulting line that starts with: PARAGRAPH>Chapter Two In Santa Croce with No Baedeker: doesn't match the regex, which expects the chapter marker to be at the beginning of the string.

This splitting issue (maybe there are more issues with splitting, but this is the one I investigated) causes a number of books to fail to split. Here's the list of books that the data collection script downloaded, but failed to properly split for me:

gradesaver/A Room With a View
gradesaver/A Tale of Two Cities
gradesaver/Adam Bede
gradesaver/Anne of Green Gables
gradesaver/Antony and Cleopatra
gradesaver/As You Like It
gradesaver/Babbitt
gradesaver/Bleak House
gradesaver/Dombey and Son
gradesaver/Dr. Jekyll and Mr. Hyde
gradesaver/Dracula
gradesaver/Emma
gradesaver/Ethan Frome
gradesaver/Every Man in His Humour
gradesaver/Frankenstein
gradesaver/Gulliver's Travels
gradesaver/Incidents in the Life of a Slave Girl
gradesaver/Jane Eyre
gradesaver/Kidnapped
gradesaver/King Solomon's Mines
gradesaver/Little Women
gradesaver/Middlemarch
gradesaver/My Antonia
gradesaver/Northanger Abbey
gradesaver/Regeneration
gradesaver/Sense and Sensibility
gradesaver/Tess of the D'Urbervilles
gradesaver/The Age of Innocence
gradesaver/The Blithedale Romance
gradesaver/The House of the Seven Gables
gradesaver/The Jungle
gradesaver/The Marrow of Tradition
gradesaver/The Monkey's Paw
gradesaver/The Prince
gradesaver/The Red Badge of Courage
gradesaver/The Rise of Silas Lapham
gradesaver/The Rivals
gradesaver/The School for Scandal
gradesaver/The Spanish Tragedy
gradesaver/The Tempest
gradesaver/The Time Machine
gradesaver/The Turn of the Screw
gradesaver/The Valley of Fear
gradesaver/Troilus and Cressida
gradesaver/Twelve Years a Slave
gradesaver/What Maisie Knew
novelguide/Henry VI Part 1
novelguide/Madame Bovary
novelguide/Merry Wives of Windsor
novelguide/Oliver Twist
novelguide/Persuasion
sparknotes/Adam Bede
sparknotes/Anne of Green Gables
sparknotes/Anthem
sparknotes/Candide
sparknotes/Dr. Jekyll and Mr. Hyde
sparknotes/Dracula
sparknotes/Emma
sparknotes/Far from the Madding Crowd
sparknotes/Frankenstein
sparknotes/Hamlet
sparknotes/Jane Eyre
sparknotes/Kidnapped
sparknotes/Northanger Abbey
sparknotes/Persuasion
sparknotes/Regeneration
sparknotes/Romeo and Juliet
sparknotes/The Brothers Karamazov
sparknotes/The House of the Seven Gables
sparknotes/The Jungle
sparknotes/The Last of the Mohicans
sparknotes/The Picture of Dorian Gray
sparknotes/The Prince
sparknotes/The Red Badge of Courage
sparknotes/The Secret Garden
sparknotes/The Turn of the Screw
jigsaw2212 commented 3 years ago

Hi @dojoteef. Thanks for raising this issue. I'm looking into this at the moment. The regexes that we use to splitting the aggregate summaries are not exhaustive and at times we had to manually intervene and modify the lines in the text, rather than having a regex for every outlier scenario. I am adding some more checks to try and split as many book chapters as possible from this list you've shared. Thanks again!

jigsaw2212 commented 3 years ago

My latest commit should've fixed this issue. I have tested out the fix for all sources, and should help with splitting some of the book chapters that we were not able to get before.

dojoteef commented 3 years ago

Thanks so much for looking into all these issues I've brought up! I've shifted focus for now, but will be coming back to the BookSum dataset in the near future and will try to see if I encounter any additional blockers.

Thanks again for all the help!