huggingface / speechbox

Apache License 2.0
342 stars 33 forks source link

An idea for enhancing punctuation restoration for non-space separated languages #8

Open jumon opened 1 year ago

jumon commented 1 year ago

Hello! Thank you for creating such a nice project. I was pleasantly surprised to see that you also utilized Whisper for punctuation restoration, as I had the same idea and my implementation is available here: (https://github.com/jumon/whisper-punctuator). In fact, your implementation looks nicer than mine 😆

I have one small suggestion for improvement. I noticed that your code splits words by spaces and only inserts punctuation marks after words. This approach may not work well for languages such as Japanese and Chinese, which do not use spaces to indicate word boundaries. It may be beneficial to allow punctuation insertion at other locations or to make it an optional feature.

Thank you!

patrickvonplaten commented 1 year ago

Oh yeah good point! It'd be really nice to add punctuation restoration for more than just English indeed! More than happy to review a PR :-)