ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Last phrase punctuation #19

Open ghost opened 6 years ago

ghost commented 6 years ago

First of all. Thank you for putting this project together, it's incredible and incredibly useful.

I've noticed that the last phrase or sentence in a block of text is not having punctuation added. Do you have any advice on how to fix this?

Thank you

migueljette commented 6 years ago

Hi @tcollins590 I guess I would personally add a "post-processing" step after applying this model to fix this type of error. If you have a file called punc.txt with the punctuation added, then you could do something like cat punc.txt | sed 's/\([a-zA-Z]\)$/\1./g' > new_punc.txt. This would add a period at the end of any sentence that ends with a letter (lower or upper case). It's not perfect obviously... but it would work in most cases.

ottokart commented 6 years ago

Hi! I have an idea for the fix, but I'll have more free time in a few months to implement it. The idea is to change the part in punctuator.py where the model selects the punctuation with highest probability: p_i = np.argmax(y_t.flatten()) by adding a mask that sets the probabilities of non-end-of-sentence punctuations (plus the no-punctuation class) to zero if we have reached the end of input text: p_i = np.argmax(y_t.flatten() * eos_mask) This would force the model to choose between period, question mark or exclamation.

rhamnett commented 5 years ago

Hello @ottokart it would be great if you would manage to implement this. Thanks