nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

🐛 doc_type='pdf' no longer works #75

Closed matthewmcintire closed 4 years ago

matthewmcintire commented 4 years ago

Describe the bug After the latest update, pdf mode no longer works. New lines seem to always get recognized as new sentences. To Reproduce Steps to reproduce the behavior: Input text - "This is a sentence\ncut off in the middle because pdf."

Expected behavior Expected output - "This is a sentence\ncut off in the middle because pdf."

nipunsadvilkar commented 4 years ago

@matthewmcintire Hey it's recommended to use doc_type="pdf" mode along with clean=True since cleaner trims those intermediate newlines and you would no longer be able to use char_span functionality since the original text gets modified.

Thanks for pointing out. I will update tests to raise an exception and force the user to follow the above-mentioned usage.