mguterl / textractor

A ruby library that provides a simple wrapper for CLI tools to extract text from PDF and Word documents.
MIT License
13 stars 4 forks source link

Strip formatting #1

Open bamnet opened 13 years ago

bamnet commented 13 years ago

When I run PDF tests I get output that looks like this

Textractor returns the contents of pdf documents Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text' expected: "text", got: "text\t\r \302\240 \t\r \302\240" (using ==)

My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?

In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.

mguterl commented 13 years ago

I'm going to have to think about this some more, it looks like all of the extractors are calling String#strip to remove trailing whitespace. It looks like those characters just represent whitespace of some type and if that is the case I'm fine with coming up with a replacement for String#strip that grabs these characters too.