oliverguhr / deepmultilingualpunctuation

A python package for deep multilingual punctuation prediction.
MIT License
92 stars 20 forks source link

Full stops after numbers unnoticed, extra ones predicted #9

Open alexdiment opened 1 year ago

alexdiment commented 1 year ago

Hi and thanks a lot for the great tool!

Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.

This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'

Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.

oliverguhr commented 1 year ago

Good catch @alexdiment.

The issue is, that the model cannot tell if 123 should be 1.23, 12.3 or 123. I wanted to avoid the case where the model messes with decimal points. I would suggest a post-processing step, that ignores punctuation markers from the model if they are already present in the text.

It's a rather small improvement, but I have no time to implement it any time soon. So if someone could help out, it would greatly be appreciated.