Closed matthewmcintire closed 4 years ago
@matthewmcintire Hey it's recommended to use doc_type="pdf"
mode along with clean=True
since cleaner
trims those intermediate newlines and you would no longer be able to use char_span
functionality since the original text gets modified.
Thanks for pointing out. I will update tests to raise an exception and force the user to follow the above-mentioned usage.
Describe the bug After the latest update, pdf mode no longer works. New lines seem to always get recognized as new sentences. To Reproduce Steps to reproduce the behavior: Input text - "This is a sentence\ncut off in the middle because pdf."
Expected behavior Expected output - "This is a sentence\ncut off in the middle because pdf."