louismullie / treat

Natural language processing framework for Ruby.
Other
1.37k stars 128 forks source link

Chunking multi-line sentences #118

Closed stefan-pdx closed 8 years ago

stefan-pdx commented 8 years ago

Hi,

I'm relatively new to Treat and am trying to figure out how to chunk text across multiple lines. For example, given the document:

In my younger and more vulnerable years my father gave me some advice
that I've been turning over in my mind ever since.

"Whenever you feel like criticizing any one," he told me, "just
remember that all the people in this world haven't had the advantages
that you've had."

When chunking that text, Treat treats (da boom CHING) each line as a separate paragraph:

>> d = document("gatsby.txt").chunk
=> Document (70295776049280)  --- "In my younger [...] you've had.\""  ---  {:file=>"gatsby.txt", :format=>"txt"}   --- []
>> d.print_tree
+ Document (70295776049280)  --- "In my younger [...] you've had.\""  ---  {:file=>"gatsby.txt", :format=>"txt"}   --- []
|
+--> Paragraph (70295776045740)  --- "In my younger [...] some advice"  ---  {}   --- []
+--> Paragraph (70295776043380)  --- "that I've been [...] ever since."  ---  {}   --- []
+--> Paragraph (70295802116360)  --- "\"Whenever you feel [...] me, \"just"  ---  {}   --- []
+--> Paragraph (70295802114580)  --- "remember that all [...] the advantages"  ---  {}   --- []
+--> Paragraph (70295802112840)  --- "that you've had.\""  ---  {}   --- []

I would expect for there to be two paragraphs. Does Treat support this parsing behavior? Are there any strategies that could be used in pre-processing to join line returns?

Thanks!

stefan-pdx commented 8 years ago

Ah, after looking at the implementation of the default txt Chunker, I see it treats each line as a separate zone. I assume that a customer Chunker has to be written.

stefan-pdx commented 8 years ago

For others who came across a similar question, the documentation briefly talks about how to create additional workers for accomplishing something like this.