segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
712 stars 41 forks source link

Which setting is the best for scientific sentence segmentation with inline citations and potential parsing error #138

Open realliyifei opened 4 days ago

realliyifei commented 4 days ago

I am dealing with sentence segmentation of scientific papers with inline citations (and potential parsing error).

It seems this tool cannot process the incline citations adaptively, such as (Author, Year), (Author), Author (Year), or [1], etc.

e.g.

sat = SaT("sat-3l”)
text = "Wang et al. (2021) analyzed fauxtography images in social media posts and found that posts with doctored images increase user engagement in the form of re-shares, likes, and comments, specifically in Twitter and Reddit.”
output = sat.split(text) # output ['Wang et al.', '(2021)', 'analyzed fauxtography …’]

Is there any setting (e.g. style_or_domain, model, threshold, etc.) can be adjusted to better deal this scenario?

I tried the spacy and nltk, none of them works. So I still hope this kind of tool can provide a good solution.

markus583 commented 20 hours ago

Hi, this sounds like it could be quite a bit out-of-domain since we are neither dealing with regular sentences nor paragraphs. None of our models or LoRA modules were specifically trained for this task, although I assume larger models (12l/12l-sm) will be bettter able to handle this task. One way to get good performance for this task is to adapt our models via LoRA on your task (see the readme). Alternatively, you can try lowering the threshold which should result in more splitting.