ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

Words Splitting Automatically #34

Open shavakagrawal opened 6 years ago

shavakagrawal commented 6 years ago

Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!

ottokart commented 6 years ago

Hi!

it's probably the nltk tokenizer that is used in 2 scripts: demo_play_with_model.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/demo_play_with_model.py and example/dont_run_me_run_the_other_script_instead.py https://github.com/ottokart/punctuator2/blob/5161946e0fdc144a607db4eaa4ef968e8f6e3d77/example/dont_run_me_run_the_other_script_instead.py

To fix that you can modify the untokenizer (should work for both scripts): untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot") to: untokenizer = lambda text: text.replace(" '", "'").replace(" n't", "n't" ).replace("can not", "cannot").replace("gon na", "gonna").replace("wan na", "wanna")

Or change: from nltk.tokenize import word_tokenize to: word_tokenize = lambda x: x.split()

On Fri, 24 Aug 2018 at 14:00, Shavak Agrawal notifications@github.com wrote:

Hi Ottokar, I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ottokart/punctuator2/issues/34, or mute the thread https://github.com/notifications/unsubscribe-auth/AJWV4CX7Lt_Megch2hw9CHWSumlIsUceks5uT9zPgaJpZM4WLKL0 .

shavakagrawal commented 6 years ago

Thanks a lot! I'll raise a pull request with that?

ninjakx commented 3 years ago

Is it fixed now?