The current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols etc..
For example :
In 1996, 1996 people sent emails at someone @ example . com at 1:30 PM.
In nineteen ninety six, one thousand nine hundred and ninety six people sent emails at someone at example dot com at one thirty p m
and all the alternative versions.
The library needs to be integrated in subtitle parser (srtparser.h).
The current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols etc..
For example :
In 1996, 1996 people sent emails at someone @ example . com at 1:30 PM.
In nineteen ninety six, one thousand nine hundred and ninety six people sent emails at someone at example dot com at one thirty p m
and all the alternative versions.
The library needs to be integrated in subtitle parser (srtparser.h).