tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link

SRX for sentence_splitter #31

Open fewzee opened 5 months ago

fewzee commented 5 months ago

Hi Thomas, I am hoping to use SoMaJo's sentence_splitter in rust, and I am wondering if it would be possible to formulate it in terms of SRX rules? I would happily contribute to making that happen, but I wanted to check with you regarding feasibility before going forward.

tsproisl commented 3 months ago

Hi,

unfortunately I’m not familiar with Rust or with SRX rules. From the link you posted, it seems like it should be possible to express SoMaJo’s sentence splitting rules in SRX. However, I’m not sure if that is what you want. SoMaJo’s sentence splitter operates on tokenized input, i.e. the input text is first tokenized before it is passed to the sentence splitter. While you could convert the sentence splitter, you would still need to run SoMaJo’s tokenizer (or another tokenizer) first.