moses-smt / mosesdecoder

Moses, the machine translation system
http://www.statmt.org/moses
GNU Lesser General Public License v2.1
1.58k stars 775 forks source link

sentence-splitter #193

Closed matanox closed 6 years ago

matanox commented 6 years ago

Hi,

Fiddling the sentence splitting preprocessing util, we seem to get nothing really split. Probably a usage issue. Here's what we try in a UTF-8 terminal:

$ ./split-sentences.perl -b -l en
Sentence Splitter v3
Language: en
aaa? bbb ccc dd. ...a d. ..d.a

aaa? bbb ccc dd. ...a d. ..d.a
<P>

Any ideas off the top of your head?

Thanks!

hieuhoang commented 6 years ago

better off asking the mailing list. Hardly anyone pays attention to this forum http://mailman.mit.edu/mailman/listinfo/moses-support please subscribe before you post

hieuhoang commented 6 years ago

closing this. Looks like no-one's responding to this forum

matanox commented 6 years ago

Why close, if it is essentially open?

hieuhoang commented 6 years ago

I'll reopen it, but don't be surprised if u get no response.

matanox commented 6 years ago

don't worry, I'll refer to it on the mailing list that you suggested :-)

tomekd commented 6 years ago

I know nothing about moses' sentence splitter but give a try for eserix. I used it from time to time.

matanox commented 6 years ago

Thanks @tomekd, do you know whether it is accommodates different languages, v.s. being just useful for English? we're looking for something covering a wide range of languages, not that the Moses script was necessarily perfect at that.

tomekd commented 6 years ago

Hi,

it supports the most popular languages:

Notice that it's really simple tool using SRX files.

matanox commented 6 years ago

Well, I guess, good to learn of SRX (Segmentation Rules eXchange) now :-) Other than reading the dry spec of it, may I assume that the implied algorithm comprises a two-step flow, where first a break is matched by all the break=yes rules, and then the break may be avoided if it matches any of the break=no rules? any notable libraries that execute the rules or notable rule depos? I see version 2.0 of the standard is supposed to be "safer" and Java is lagging in regex support required for it.

Essentially the perl script here has a similar flow, although it seems to struggle with introducing extra spaces that it later needs to discard, and arguably a bit of a hack when it comes to adaptation to special domains or language registers.

hieuhoang commented 6 years ago

looks like the mailing list got you good responses. Closing now