soskek / bookcorpus

Crawl BookCorpus
MIT License
812 stars 110 forks source link

Fix merging sentences in one paragraph #7

Closed yoquankara closed 5 years ago

yoquankara commented 5 years ago

This PR simply merges sentences in stack whenever it met an empty line. I am not sure why blank was necessary at the first place, so let't discuss about it if I'm missing some thing here.

Consider one example from starting section of out_txts/100021__three-plays.txt. Current implementation output:

Three Plays Published by Mike Suttons at Smashwords Copyright 2011 Mike Sutton ISBN 978-1-4659-8486-9 Tripping on Nothing

It obviously merged the paragraph title Tripping on Nothing into stack incorrectly. With this PR, output is:

Three Plays Published by Mike Suttons at Smashwords Copyright 2011 Mike Sutton ISBN 978-1-4659-8486-9

Tripping on Nothing
soskek commented 5 years ago

Thank you! To be honest, I don't remember why I tackled the double-blank completely :)

Anyway, ignoring a line break is required for parsing texts in books with text wrapping. So, I guessed that the old me observed and tackled some books with text wrapping using one (e.g. 342391__dantes-inferno-a-discussion-guide.txt) or EVEN MORE blank lines for a decorative purpose.

But, I didn't find such a double-blank text wrapping. The process only for single-blank is enough. After some futher checks, I'll merge the PR! Thank you again!

yoquankara commented 5 years ago

Thank you for your review!

I think the fix still works for double-blank text, because it merges stack (if not empty) when reaching the first blank, then ignores any consequent blank.

I also checked the file you mentioned, 342391__dantes-inferno-a-discussion-guide.txt. Things seem ok : )

yoquankara commented 5 years ago

Hi, how is the check going?

soskek commented 5 years ago

I didn't come up with any disaster! LGTM! Thank you! I merged!