santhoshtr / wikisentences

program to create sentence dataset from wikipedia dumps
MIT License
2 stars 0 forks source link

Replace the sentence segmenter #1

Closed santhoshtr closed 10 months ago

santhoshtr commented 1 year ago

The current sentence segmenter used in this project is a very minimal one. It has several limitations.

Replace it with https://github.com/santhoshtr/sentencesegmenter

dpriskorn commented 10 months ago

Which one are you using? I'm using spaCy in https://github.com/dpriskorn/riksdagen_sentences

dpriskorn commented 10 months ago

It seems your segmenter is way faster than spaCy and also more accurate. :)

santhoshtr commented 10 months ago

Currently sentencex is used for segmentation. See https://diff.wikimedia.org/2023/10/23/sentencex-empowering-nlp-with-multilingual-sentence-extraction/ Spacy does not support much languages.