mediacloud / sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Other
230 stars 30 forks source link

Making the splitter.split() not splint at sentences within quotation(s) #5

Closed wenestam closed 3 years ago

wenestam commented 3 years ago

Hi!

I'm currently working with news article data and I'm using your amazing sentence splitter. I have a issue of not wanting to split sentences that are inside of quotation(s). Eg.

splitter.split(""" "I hereby present you the new machine. 'It is legendary. Supreme. Machine.' the daily news said." """)

becomes: '"I hereby present you the new machine.', "'It is legendary.", 'Supreme.', 'Machine.\' the daily news said.".'

and I need it to become: "I hereby present you the new machine. 'It is legendary. Supreme. Machine.' the daily news said."

So that it keeps everything inside the quotation intact. I have been trying with different regex patterns with no success. Also I have tried to mix with the source code of the sentence-splitter.

Do you have any tips or tricks? Is there anything I can change in the package code to make it ignore splitting sentences inside quotations?

Thanks in advance

pypt commented 3 years ago

Hi, dunno, sorry. Our thinking about this is that It is legendary. Supreme. Machine. are three separate sentences so we're going to keep it that way.

IIRC NLTK's tokenizers might be able to do what you want to do: https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.treebank, but I might be wrong.

wenestam commented 3 years ago

Thank you so much for the quick response :) Have a nice weekend