mediacloud / sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Other
225 stars 29 forks source link

French Model - Failed to correctly split an easy example #1

Closed ghaddarAbs closed 5 years ago

ghaddarAbs commented 5 years ago

Great Work, It seems weird for me that the algo couldn't split this sentence correctly:

(« BPP »), une ancienne filiale en propriété exclusive de Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement détenus ont été transférés à la Fiducie.

The algo consider it as 2 sentences (split at Inc.), though the example isn't hard.

ghaddarAbs commented 5 years ago

Also, the algo split sentence led by an enumeration containing dot ( 1. XXXXXX YYYYYY ......) into 2 sentences as follow: 1. XXXXXX YYYYYY ........

pypt commented 5 years ago

Hi!

Cases like this are managed by what's called the "non-breaking prefixes". The file for non-breaking prefixes for French is here:

https://github.com/berkmancenter/mediacloud-sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/fr.txt

It would be very helpful if you helped us out by submitting a PR with some more such prefixes added, or at least with some test cases for the French language.

pypt commented 5 years ago

Fixed in v1.4.