Full stops after numbers unnoticed, extra ones predicted

oliverguhr / deepmultilingualpunctuation

A python package for deep multilingual punctuation prediction.

MIT License

92 stars 20 forks source link

Hi and thanks a lot for the great tool!

Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.

This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'

Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.

oliverguhr / deepmultilingualpunctuation

Full stops after numbers unnoticed, extra ones predicted #9