ixa-ehu / ixa-pipe-tok

IXA pipes sentence segmenter and tokenizer (http://ixa2.si.ehu.es/ixa-pipes).
Apache License 2.0
11 stars 14 forks source link

Paragraph break should imply sentence boundary #3

Closed vanatteveldt closed 8 years ago

vanatteveldt commented 8 years ago

Currently, (in the Dutch tokenizer) a sentence can span paragraphs if no sentence boundary was detected at the end of a paragraph, e.g. a line without final period.

In particular, the string "Dit is de kop\n\nEn een artikel. Met een tweede zin." yields the following text layer:

  <text>
    <wf id="w1" offset="0" length="3" sent="1" para="1">Dit</wf>
    <wf id="w2" offset="4" length="2" sent="1" para="1">is</wf>
    <wf id="w3" offset="7" length="2" sent="1" para="1">de</wf>
    <wf id="w4" offset="10" length="3" sent="1" para="1">kop</wf>
    <wf id="w5" offset="15" length="2" sent="1" para="2">En</wf>
    <wf id="w6" offset="18" length="3" sent="1" para="2">een</wf>
    <wf id="w7" offset="22" length="7" sent="1" para="2">artikel</wf>
    <wf id="w8" offset="29" length="1" sent="1" para="2">.</wf>
    <wf id="w9" offset="31" length="3" sent="2" para="2">Met</wf>
    <wf id="w10" offset="35" length="3" sent="2" para="2">een</wf>
    <wf id="w11" offset="39" length="6" sent="2" para="2">tweede</wf>
    <wf id="w12" offset="46" length="3" sent="2" para="2">zin</wf>
    <wf id="w13" offset="49" length="1" sent="2" para="2">.</wf>
  </text>

I would think that a sentence should always end when a paragraph ends, or is there some substantive reason for keeping it like this?

ragerri commented 8 years ago

I have updated the sentence counter when paragraph marks encountered. Give it a try and let me know if it does now better.