Open WaelSalloum opened 7 years ago
Hi!
Actually there's even a warning that hints what's wrong: WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 1050 words.
So, the solution would be to collect much-much more training data (tens of millions of words at least).
Although, your task seems a little bit more difficult than punctuation restoration, I see no other reason why it shouldn't work at least on some level if you have enough training data.
PS! In your example you have set the hidden layer size to 4. This probably makes the model way too small (try 128 or 256). PPS! Since your sequence length is quite large (400) you might run into out-of-memory errors on some GPU-s. If that happens, you can reduce the minibatch size from 128 to 64 for example.
Best, Ottokar
Hi Ottokart,
I'm playing with punctuator2 in a different setting than punctuation restoration. I'm using it to predict long distance tokens; e.g., the input is a document (around 400 words) and I'm trying to predict a token (or more) inside that document (not actually a punctuation), say a paragraph ending. I don't know if punctuator2 will generalize to such tasks so please let me know if there are any restrictions that are specific to punctuation restoration. Now, everything is fine when I use it to predict punctuations and it performs exceptionally, but when I use it for this unusual task, the data.py script produces tiny files from my training data (remember, most of the time there is only one "punctuation" token in a given ~400-word line representing a document):
Where my text files are:
This is the error I get when I train:
I realize that my data is small, but the tiny data files generated by data.py suggest that text is being omitted. I wonder if there is another reason for this issue related to the code being not designed for such tasks. Will doing any of the following solve my problem:
If none will work, no problem, I'll build my own architecture.
Thank you. Wael