ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

We see spaces inserted that break words in the original text #48

Open gsunsnackv opened 4 years ago

gsunsnackv commented 4 years ago

A small example: When run the following on http://bark.phon.ioc.ee/punctuator

how's everybody doing I'm Dalton I'm a partner at Y Combinator in addition I'm the head of Admissions which is our selection process but the companies that get into YC I am here to talk about pivoting yeah let's talk all about pivoting cool all right here are some stuff we're gonna cover what the heck is a pivot why you should pivot when you should pivot and evaluating ideas to pivot to so we're gonna try to cover all the bases here

The result we got is:

How'S everybody doing I'm Dalton, I'm a partner at Y Combinator. In addition, I'm the head of Admissions, which is our selection process, but the companies that get into YC. I am here to talk about pivoting yeah. Let'S talk all about pivoting cool. All right here are some stuff we're gon na cover. What the heck is a pivot, why you should pivot when you should pivot and evaluating ideas to pivot to so we're gon na try to cover all the bases here.

All the words gonna got broke into gon na . Is there a config that we can prevent this from happening?

sontung commented 4 years ago

The API to interact with the web server is hidden. Probably there is a bug in post-processing. Did you try with the model on your local?

gsunsnackv commented 4 years ago

No. I have not tried it with the model locally. I will try to find some time to do that. But who owns the API? I'm trying to evaluate the project to see if I should use it. Now I have to set it up to evaluate. If API is not doing what the model does locally, then it failed to be a tool to showcase the project

sontung commented 4 years ago

locally, I see no problem with the above input. The API was implemented by the author himself and should not be public. The error you are seeing must come from the post-processing as all the words will not be broken down when they go into the model. The reason is that the author has a static vocabulary. Personally, I don't think you need the API for your project.

gsunsnackv commented 4 years ago

Thank you @sontung . I still have not got time to run it locally myself. Too busy 😞 Will do it soon

gsunsnackv commented 4 years ago

@sontung I played with python play_with_model.py ./Demo-Europarl-EN.pcl

And indeed no problem like this. I just need to figure out how to work with it locally then. Thank you