Creating dictionairy files

ZheMann commented 5 years ago

Right now I'm executing createspmodel.sh with a text-file containing all books from Project Gutenberg written in the Dutch language to generate the dictionary files. Do you think this is sufficient? Or should I also use a wikipedia-scraper for example, to extend the amount of text for creating the dictionary files?

To me it seems 'the more data, the better' when initialising the vocabulairy files. @rkfg could you give your opinion about this?

rkfg commented 5 years ago

I don't think the dictionary quality will improve significantly if you use more data. I even limited the number of lines to process because Sentence Piece itself recommends that. The lines that are taken for processing are sampled randomly so it doesn't mean it will only take the first million lines from your file.

To get the most efficient dictionary you should include the most common phrases into it, not just use a lot of data. Any words that don't make it to the dictionary will be encoded as individual characters. But then again, if the word is so rare it can't be encoded with 2-3 tokens it's unlikely the model will use it in future at all.

ZheMann commented 5 years ago

Thank you for the clear answer. One quick question: do I have to determine the optimal vocab_size by trial and error? I've searched through the entire sentencepiece repository but could not find any straightforward guidelines regarding this.

rkfg commented 5 years ago

Probably, yes. I can't find it now but I remember that some paper mentioned around 30-50k tokens for this encoding type. It depends on the language and this whole field is pretty much intuition-driven (from what I read and saw). You can't calcualte the optimal network architecture or the number of tokens, the networks are so big that the only method is trial, error, rinse and repeat.

Here's a couple of easy to read papers to get you started: https://arxiv.org/pdf/1808.06226.pdf and https://arxiv.org/pdf/1508.07909.pdf

They contain no math (I just don't get complex math tbh) and everything else is mostly common sense and logic.

I personally tried vocabularies with 10k and 50k tokens, surprisingly the 10k model converged faster and the resulting loss was much lower (around 3.5 compared to 4+ for 50k model). But the output was still not impressive and maybe the 50k model has more potential for improving in time. It all requires a lot of experimentation.

Also, one thing to remember: your data size (in tokens) must be a lot bigger than your model size. Or else it will just memorize the corpus and produce garbage on arbitrary input. I used a huge Russian books dump, it contains zipped fb2 books, the overall size is more than 400 Gb. Of course, there are many duplicates and not all books are in Russian so I did some filtering first and in the end produced a corpus of around 10 Gb or so. To fully sample it (the train script selects ranodm lines, not sequential) my system would require about 6 days.

ZheMann commented 5 years ago

Dude, you're awesome! Thanks for the valuable information, I will definitely study the papers you mentioned. I will try different values for the vocab_size and see what happens.

However, after you mentioned this:

Also, one thing to remember: your data size (in tokens) must be a lot bigger than your model size. Or else it will just memorize the corpus and produce garbage on arbitrary input. I used a huge Russian books dump, it contains zipped fb2 books, the overall size is more than 400 Gb. Of course, there are many duplicates and not all books are in Russian so I did some filtering first and in the end produced a corpus of around 10 Gb or so. To fully sample it (the train script selects ranodm lines, not sequential) my system would require about 6 days.

I just realised the biggest challenge will be finding sufficient amounts of texts written in Dutch, as the total size of all Dutch books on Gutenberg.org is less than 100MB.

Anyways, things are starting to get more and more clear to me now.

Many thanks again.

rkfg commented 5 years ago

Yeah, that corpus is way too small. You can try translating books with Google for starters or find other sources (you're not expecting me to buy 400 Gb of compressed books and I don't think you can find that many in public domain so...). The whole point of neural networks is to lossy "compress" the data into their internal structure to be able to find patterns in it. That's because you require it to correctly predict the next token based on the previous tokens and it should be able to do that on a lot of different text lines that can't be stored inside. If your data can be stored "as is" because the model size allows it, it's not forced to optimize itself and find the patterns, hence it doesn't learn at all, it just memorizes.

ZheMann commented 5 years ago

As this issue is still 'Open', I guess this is a good place to ask the following question:

If you replace the new line character with a custom character like <|n|>, you will end up with one very large sentence in the end, right? How did this work for you while generating the dictionairy files? Because right now it says I only have one sentence which is (obviously) too large to process. However, I defined <|n|> in user_defined_symbols so I expected SentencePiece to cut the large sentence into original sentences based on <|n|> for further processing.

rkfg commented 5 years ago

As far as I remember the script doesn't replace the new lines but insert that token before it so the sentences are short enough. Take a look at concat.sh.

rkfg / gpt-2

Creating dictionairy files #4