Strip formatting - Githubissues

mguterl / textractor

A ruby library that provides a simple wrapper for CLI tools to extract text from PDF and Word documents.

MIT License

13 stars 4 forks source link

When I run PDF tests I get output that looks like this

Textractor returns the contents of pdf documents Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text' expected: "text", got: "text\t\r \302\240 \t\r \302\240" (using ==)

My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?

In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.

mguterl / textractor

Strip formatting #1