saurabhshri / CCAligner

🔮 Word by word audio subtitle synchronisation tool and API. Developed under GSoC 2017 with CCExtractor.
165 stars 34 forks source link

Find and integrate a text tokenisation library. #7

Open saurabhshri opened 7 years ago

saurabhshri commented 7 years ago

The current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols etc..

For example :

In 1996, 1996 people sent emails at someone @ example . com at 1:30 PM.

In nineteen ninety six, one thousand nine hundred and ninety six people sent emails at someone at example dot com at one thirty p m

and all the alternative versions.

The library needs to be integrated in subtitle parser (srtparser.h).

nshmyrev commented 7 years ago

https://github.com/google/sparrowhawk

saurabhshri commented 7 years ago

@nshmyrev Thanks! That looks really nice! :)