Open realliyifei opened 4 days ago
Hi, this sounds like it could be quite a bit out-of-domain since we are neither dealing with regular sentences nor paragraphs. None of our models or LoRA modules were specifically trained for this task, although I assume larger models (12l/12l-sm) will be bettter able to handle this task. One way to get good performance for this task is to adapt our models via LoRA on your task (see the readme). Alternatively, you can try lowering the threshold which should result in more splitting.
I am dealing with sentence segmentation of scientific papers with inline citations (and potential parsing error).
It seems this tool cannot process the incline citations adaptively, such as (Author, Year), (Author), Author (Year), or [1], etc.
e.g.
Is there any setting (e.g. style_or_domain, model, threshold, etc.) can be adjusted to better deal this scenario?
I tried the spacy and nltk, none of them works. So I still hope this kind of tool can provide a good solution.