Open DSuveges opened 10 months ago
Hi @DSuveges. I think these issues are arising due the traditional sentence segmenter we are using. There are obvious problems like this you have mentioned and also the time complexity.
I have been doing some experiments in this regard and here are some updates.
please let me know how dangerous is this problem? If you can wait until end of March. I am hopeful I can solve these issues.
Thank you @tsantosh7 for addressing the issue! I don't have numbers on the impact of this issue, but it cannot be that bad, getting an update in March is absolutely fantastic. We are looking forward getting the new sentenciser to be inplace, your benchmarks sounds super promising.
@DSuveges due to slurm migration at EBI, this is postponed to end of May rather than end of March. Sorry for inconvenience
Update from Santosh: He is working on refining epmc pipelines which will fix this issue. Hopefully in the next two months.
A user reported that there are leftover HTML tags in sentences. Beyond being a cosmetic problem, apparently this behaviour can affect sentencing therefore disease to target evidence generation.
I was wondering if we could discuss under this ticket how difficult it would be to update the parser/sentensiser to resolve these issues. Also, to measure the scope of this issue.