HTML tags in literature sentences

DSuveges commented 10 months ago

A user reported that there are leftover HTML tags in sentences. Beyond being a cosmetic problem, apparently this behaviour can affect sentencing therefore disease to target evidence generation.

I was wondering if we could discuss under this ticket how difficult it would be to update the parser/sentensiser to resolve these issues. Also, to measure the scope of this issue.

tsantosh7 commented 10 months ago

Hi @DSuveges. I think these issues are arising due the traditional sentence segmenter we are using. There are obvious problems like this you have mentioned and also the time complexity.

I have been doing some experiments in this regard and here are some updates.

The traditional sentence segmenter is taking around 45s to process one article compared to Scispacy sentence segmenter which takes 5 seconds.
I believe moving to Sci spacy sentence segmenter would solve this issue.
I have planned to run new models I developed along with the sentence segmenter the whole batch for EuropePMC corpus around March next year.
This run would ensure you receive annotations mentions from non open access set as well.

please let me know how dangerous is this problem? If you can wait until end of March. I am hopeful I can solve these issues.

DSuveges commented 9 months ago

Thank you @tsantosh7 for addressing the issue! I don't have numbers on the impact of this issue, but it cannot be that bad, getting an update in March is absolutely fantastic. We are looking forward getting the new sentenciser to be inplace, your benchmarks sounds super promising.

tsantosh7 commented 5 months ago

@DSuveges due to slurm migration at EBI, this is postponed to end of May rather than end of March. Sorry for inconvenience

prashantuniyal02 commented 1 month ago

Update from Santosh: He is working on refining epmc pipelines which will fix this issue. Hopefully in the next two months.

opentargets / issues

HTML tags in literature sentences #3175