ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

data.py is generating tiny files from my training data #7

Open WaelSalloum opened 7 years ago

WaelSalloum commented 7 years ago

Hi Ottokart,

I'm playing with punctuator2 in a different setting than punctuation restoration. I'm using it to predict long distance tokens; e.g., the input is a document (around 400 words) and I'm trying to predict a token (or more) inside that document (not actually a punctuation), say a paragraph ending. I don't know if punctuator2 will generalize to such tasks so please let me know if there are any restrictions that are specific to punctuation restoration. Now, everything is fine when I use it to predict punctuations and it performs exceptionally, but when I use it for this unusual task, the data.py script produces tiny files from my training data (remember, most of the time there is only one "punctuation" token in a given ~400-word line representing a document):

$ du -sch data/*
4.0K    dev
4.0K    test
16K train
152K    vocabulary

Where my text files are:

$ wc *
     180   139630   852376 ep.dev.txt
     180    62584   371250 ep.test.txt
    4588  2007796 12097309 ep.train.txt

This is the error I get when I train:

$ python main.py ep 4 0.02
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN not available)
4 0.02 Model_ep_h4_lr0.02.pcl
Loading previous model state
Number of parameters is 84008
Training...
Total number of training labels: 1029
WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 1050 words.
Total number of validation labels: 0
Traceback (most recent call last):
  File "main.py", line 182, in <module>
    ppl = np.exp(total_neg_log_likelihood / total_num_output_samples)
ZeroDivisionError: division by zero

I realize that my data is small, but the tiny data files generated by data.py suggest that text is being omitted. I wonder if there is another reason for this issue related to the code being not designed for such tasks. Will doing any of the following solve my problem:

  1. Use pre-trained vectors.
  2. Collect more training data.
  3. Modify parts of the code (if it's a quick fix).

If none will work, no problem, I'll build my own architecture.

Thank you. Wael

ottokart commented 7 years ago

Hi!

Actually there's even a warning that hints what's wrong: WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 1050 words.

So, the solution would be to collect much-much more training data (tens of millions of words at least).

Although, your task seems a little bit more difficult than punctuation restoration, I see no other reason why it shouldn't work at least on some level if you have enough training data.

PS! In your example you have set the hidden layer size to 4. This probably makes the model way too small (try 128 or 256). PPS! Since your sequence length is quite large (400) you might run into out-of-memory errors on some GPU-s. If that happens, you can reduce the minibatch size from 128 to 64 for example.

Best, Ottokar