Duration of encoding a dataset ~2.4GB

ZheMann commented 5 years ago

Currently I have a dataset of roughly 2.4GB in size, and I am trying to encode it in Google Colab. However, after encode.sh finished task 'encoding with spm' it takes forever to finish the next task 'Loading the data and packing into encoded.npz'. It says it needs to read 236 files and the estimated remaining time is around 10+ days. is this normal for a dataset of this size? I expected the encoding to take only a couple of hours.

rkfg commented 5 years ago

No, that's not normal. It took about an hour for my bigger 22 Gb set on my machine. You can try to run the script locally, it doesn't use GPU at all but it does require quite a bit of RAM (roughly of the dataset size, give or take 10%). Does the progress bar even move?

ZheMann commented 5 years ago

Weird, I can use up to 12GB of RAM within my colaboratory notebook which should be more than enough. Progress bar moves very, very slow. Unfortunately, I'm having some trouble installing SentencePiece locally so I guess I'll have to split my dataset into smaller chunks, encode them and eventually concatenate them back into one encoded file.

rkfg commented 5 years ago

How's the CPU load when it loads the data? Maybe it spends too much time in kernel space?

ZheMann commented 5 years ago

I can't see the CPU load unfortunately on Colab. However, do you think this could be caused by using the same textfile for both creating the dictionary files and to train the model on?

rkfg commented 5 years ago

do you think this could be caused by using the same textfile for both creating the dictionary files and to train the model on?

No, that's unlikely as these phases are independent of each other. You prepare the dictionary before training anyway. You indeed can create multiple .npz files and then use the directory name containing them as your --dataset. It would require some small changes to encode.sh. Instead of running encode.py on the whole $OUTDIR you'd want to iterate over the files in $OUTDIR and run encode.py on each of them saving the results to distinct .npz files.

ZheMann commented 5 years ago

Sounds like a good idea. Currently, the encoding is running for 3h45m and it says it should be finished in about 2 hours. So, right now I hope the encoding will finish within the estimated time so I can store the encoded file somewhere save. If things go wrong, I'll go and implement your solution.

ZheMann commented 5 years ago

It seems like I messed up something when forking your repo. After cloning your repo, the encoding part only takes a couple of minutes indeed. I'll look into it to see whether I changed something accidentally. Sorry for bothering you.

rkfg commented 5 years ago

Such things happen indeed, no problem! Good luck with your experiments.

rkfg commented 5 years ago

I added parallel processing for .npz files so now you'll get multiple smaller data files and the overall encoding should be faster. The only sequential part left is splitting the file but I think it's a small price to pay. Ideally, the ready splits should be processed right after creation but it would be hard to synchronize. And splitting takes a little time compared to encoding and collecting to .npz. It also needs much less memory to process the files because all chunks are processed independently.

ZheMann commented 5 years ago

Awesome dude, really appreciate the effort you have put into this.

In case you wondered: my issue indeed occured after a (wrong) merge between nshepperd's and your repository. In load_dataset.py the following statement got accidentally removed:

elif path.endswith('.ids'):
            with open(path, 'r') as ids:
                tokens = np.stack(list(map(int, filter(None, ids.read().replace('\n', ' ').split(' ')))))
                token_chunks.append(tokens)

Therefore, when executing encode.sh, the .ids-files were processed as if it were plain text.

rkfg commented 5 years ago

So you basically were encoding it twice (second time in Python). Yeah, no wonder it took this long. I deliberately wrote encoding in shell script (actually, in C++ using the binary spm_encode tool) and not in Python because I compared the speed and Python was WAY behind, like several times slower. And all that despite using the same native library! So the simple line iteration on the Python level took too much time compared to actual encoding (or maybe all the glue between Python and native code).

I'm a real fan of parallel processing because it saves so much time on these huge files. It's a bit tricky with shell scripts but xargs -P helps a lot basically providing a simple but effective worker pool.

ZheMann commented 5 years ago

So you basically were encoding it twice (second time in Python). Yeah, no wonder it took this long. I deliberately wrote encoding in shell script (actually, in C++ using the binary spm_encode tool) and not in Python because I compared the speed and Python was WAY behind, like several times slower. And all that despite using the same native library! So the simple line iteration on the Python level took too much time compared to actual encoding (or maybe all the glue between Python and native code).

Indeed, I was encoding my encoded files. After the second encoding finally finished, my model did not generate words but numbers, so that's how I figured it out.

rkfg / gpt-2

Duration of encoding a dataset ~2.4GB #6