Sentence segmentation with apostrophes

perseids-project / perseids_treebanking

Repository for holding files related to student treebanking projects under Perseids (obsolete)

2 stars 2 forks source link

Sentence segmentation with apostrophes #1

Open fbaumgardt opened 11 years ago

fbaumgardt commented 11 years ago

Apostrophes ʼ are not parsed correctly - sometimes they appear in pairs to mark quotations. The second apostrophe usually gets assigned to the following sentence and if there is none (-> end of chapter), it will be assigned its own sentence with length=1. You can find those locations searching for "1".*\n\s{3}</.

I am not familiar with the sentence id schema here - how can we fix a bug that affects sentence segmentation?

balmas commented 10 years ago

This is a bug in the old Perseus segmentation code and something that should be noted as a requirement for the Annotation Service and any tokenization services we use in Perseids.