ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Punctuation based on part-of-speech tags? #4

Open colinskow opened 7 years ago

colinskow commented 7 years ago

This is a great project! I'm working on some automatic transcription software. All the speech recognition engines I've looked at produce a straight stream of words and I haven't come across anything that can intelligently break up sentences and put in punctuation.

This is the most advanced project I have come across so far.

Although it seems to me that punctuation rules are based almost entirely on part-of-speech tags and the actual words may be unnecessary information.

Before I stumbled across this project my idea was to put text through a part of speech tagger (such as this one) and put the part of speech sequence through an LSTM neural net to predict punctuation.

This may allow you to get more accurate results with much less training data. Have you tried this approach at all?

The other idea I've had is to use a constituency tree. I've played around with Microsoft's tree and it may be possible to achieve this using strict if-then rules with no neural net involved.

ottokart commented 7 years ago

Thanks, I'm glad you have found this project useful! I have not tried POS tag based model, but I would be very interested in the results if you happen to try it. Constituency tree based approach sounds also interesting.

There are several reasons why I did not try using POS features (alone or with words):

  1. This model is trained and works on unsegmented text, but POS taggers tend to be most accurate when dealing with sentences.
  2. Not relying on POS tagger makes this model usable on a wider range of languages where a good POS tagger might not be available.
  3. I have a personal preference for single-model simple solutions.

There's actually some previous work that uses POS tags among other features. Some of the most recent ones are:

colinskow commented 7 years ago

Hi Ottokar,

Thank you very much for your reply. I'm a complete machine learning newbie and am just getting my feet wet. I'm experienced with Javascript web development but not so much with Python.

I'm going to try training your project, but replacing the actual words with POS tags. The total vocabulary is only about 45 tags and they should be detailed enough to make punctuation decisions independent of actual words.

I'll also test the error rate in part of speech tagging that is caused by running several sentences together. My thinking is that even if there are errors, as long as they are consistent the neural net should have enough context to see patterns.

Does your project look at words ahead of where it makes punctuation decisions? Playing your punctuation game as a human I've noticed that I need some context ahead of the decision point to make a decision.

How many words does your punctuator look at each timeframe to make a decision?

ottokart commented 7 years ago

The model does look into the future. What you see here http://bark.phon.ioc.ee/punctuator/game is basically exactly what the model sees as well (Except the model sees numbers as NUM tokens and rare words as UNK). So for the first punctuation slot it sees a context of 1 word before and 49 words after, for second it sees 2 words before and 48 after and so on...

I think you can use the data.py script (maybe with tiny modifications and simplifications in preprocessing) by just replacing the words with their corresponding POS tags in the text.

colinskow commented 7 years ago

Hi Ottokar,

I finally got around to testing out POS tags on EuroParl. What I did was limit the vocabulary to the 1000 most common words, and then replaced all OOV words with their POS tag using spaCy.

Before processing text I tokenize it and generate indexes to map tokens back to the original space-split words. Then I predict the punctuation and map the punctuation back to the original words.

I tested this on the convolutional model since this trains 25x faster. Using POS tags increased precision very slightly, but caused recall to tank, and the overall F-Score was 5 points lower. (I am assuming it will have a similar effect on the BDNN model.)

What does work well is tokenizing the text before running it through training or prediction since this simplifies the vocabulary by eliminating compound words etc. So in theory this should help models generalize better.

The other observation I made is that using randomly initialized trainable embeddings (128d) worked much better than pretrained glove.6B.50d. This is probably because trainable embeddings get optimized for predicting punctuation rather than simply surrounding words.

I'm very curious about tweaking the model and hyper-parameters to improve results. I may try a hybrid convolutional-RNN when I have some spare time. And I am especially interested in knowing what you've already tried so I don't repeat what hasn't worked.