google / go-tika

Go package for using Apache Tika
Apache License 2.0
229 stars 40 forks source link

Client reads every response in memory #33

Closed tmaxmax closed 3 years ago

tmaxmax commented 3 years ago

Is there a reason why the Tika client always reads the whole response body in memory using ioutil.ReadAll and then copies it again in callString? It seems unnecessary and it's very inefficient, especially when sending large documents to Tika for parsing.

I've forked the repository and made some changes, the tests all pass. I'm not opening a PR yet to see why this wasn't done before, as it's not obvious to me why things work this way right now and I want to avoid breaking anything.

tbpg commented 3 years ago

I'm definitely open to adding methods that return an io.Reader. Which methods in particular do you have in mind? I don't remember any particular reason we don't return a Reader.

tmaxmax commented 3 years ago

I'm thinking of Parse and Translate, these return full-sized documents. Should I open a PR then, and discuss the changes there?

tbpg commented 3 years ago

Go for it. :smiley: Please avoid making breaking changes. I'd prefer to add additional methods than change the signature of the existing ones.