laito / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

sentence boundary detector #99

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
There was a nice paper at NAACL 2009 about sentence boundary detection that
should be straightforward to implement for ClearTK.  See:

http://www.icsi.berkeley.edu/pubs/speech/sbd_naacl_2009.pdf

Original issue reported on code.google.com by pvogren@gmail.com on 8 Jun 2009 at 2:00

GoogleCodeExporter commented 9 years ago

Original comment by pvogren@gmail.com on 10 Jun 2009 at 10:13

GoogleCodeExporter commented 9 years ago
Folks, it seems that this really needs to be done for the cleartk project reorg 
underway so that a variety of example code, etc. that makes use of sentence 
segmentation does not have to depend on the cleartk-syntax-opennlp project.  

I recently implemented a sentence boundary detector using ClearTK for another 
project.  See 
[http://code.google.com/p/biomedicus/source/browse/#svn/trunk/Biomedicus/src/mai
n/java/edu/umn/biomedicus/sentence here] for the code and am considering either 
copying it over or implementing something very similar.  The way it works is 
that a pattern based "sentence boundary" annotator goes through and creates 
positive and negative sentence boundaries.  Positive sentence boundaries will 
later be used to create sentence annotations.  A classifier-based "sentence 
boundary" annotator is then used to classify periods.  Negative sentence 
boundaries produced by the pattern-based approach are ignored by the 
classifier-based approach.  Actually, it will ignore all sentence boundaries 
created by the patter-based approach.  Finally, a sentence annotator creates 
"sentence" annotations from the "sentence boundary" annotations.  

I'm not sure what I should train the sentence segmentor on.  I think using penn 
treebank is silly for a variety of reasons - in particular because white space 
is not preserved.  I am considering annotating sentences myself with data from 
project gutenberg or wikipedia - unless you have better suggestions.  

Any suggestions about implementation or training data are welcome.  Thanks!

Original comment by pvogren@gmail.com on 1 Dec 2010 at 6:18

GoogleCodeExporter commented 9 years ago
Note that the paper you reference requires tokenization to happen *before* 
sentence segmentation, and says:

"First, proper tokenization is key. While there is not room to catalog our 
tokenizer rules, we note that both untokenized text and mismatched train-test 
tokenization can increase the error rate by a factor of 2."

The PennTreebankTokenizer assumes that it already has sentence boundaries. So 
if we actually implement the referenced paper, we'd need to change the 
tokenizer too.

As to corpora on which you could train, OpenANC has validated sentence 
boundaries in some files:

http://www.americannationalcorpus.org/OANC/index.html#annotations

Also, you could choose to go the Treebank route, but also include the Brown 
corpus part too so it's not so WSJ-ish:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

Or you could use both. ;-)

Original comment by steven.b...@gmail.com on 1 Dec 2010 at 8:58

GoogleCodeExporter commented 9 years ago
I am going to punt on this issue for a bit - at least until after I've finished 
the project re-org.  Instead I am going to provide a simple default sentence 
annotator based on the java.text.BreakIterator that we can use for test 
purposes.  

Original comment by pvogren@gmail.com on 28 Dec 2010 at 9:42

GoogleCodeExporter commented 9 years ago

Original comment by pvogren@gmail.com on 14 Jan 2011 at 9:19

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 8:18

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 17 Feb 2013 at 5:09

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:44

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:41