Which setting is the best for scientific sentence segmentation with inline citations and potential parsing error

segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.

MIT License

712 stars 41 forks source link

I am dealing with sentence segmentation of scientific papers with inline citations (and potential parsing error).

It seems this tool cannot process the incline citations adaptively, such as (Author, Year), (Author), Author (Year), or [1], etc.

e.g.

sat = SaT("sat-3l”)
text = "Wang et al. (2021) analyzed fauxtography images in social media posts and found that posts with doctored images increase user engagement in the form of re-shares, likes, and comments, specifically in Twitter and Reddit.”
output = sat.split(text) # output ['Wang et al.', '(2021)', 'analyzed fauxtography …’]

Is there any setting (e.g. style_or_domain, model, threshold, etc.) can be adjusted to better deal this scenario?

I tried the spacy and nltk, none of them works. So I still hope this kind of tool can provide a good solution.

segment-any-text / wtpsplit

Which setting is the best for scientific sentence segmentation with inline citations and potential parsing error #138