Failed to tokenize big dataset with YTTM

mgrankin / ru_transformers

Apache License 2.0

775 stars 107 forks source link

Failed to tokenize big dataset with YTTM #18

Closed aquadzn closed 4 years ago

aquadzn commented 4 years ago

Hi, thanks for the implementation.

How did you managed to tokenize your +200 GB of text with YTTM?

I tried with ~150 GB in one text file and I got a memory error

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I'm using a n1-standard-8 (8 vCPUs, 30 GB memory) from GCP, maybe I need to get the double?

while it seems to work with Huggingface Tokenizers

mgrankin commented 4 years ago

Hi,

Huggingface Tokenizers wen't around when I started the project. Today I'd probably be using it.

I've tokenised a bunch of small files. I see no reason to have one 200Gb file.

aquadzn commented 4 years ago

Hi,

Huggingface Tokenizers wen't around when I started the project. Today I'd probably be using it.

I've tokenised a bunch of small files. I see no reason to have one 200Gb file.

ok thanks I'll try

and I was asking, if I run with TPU, do I need to do the trick for big dataset that you quote in the GPU version ?

# My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.

while true
do
        ...
        sleep 1
done

mgrankin commented 4 years ago

There is always a separate VM with Dataloader to feed the TPU, that is the current architecture regardless of your project. So, yes you need the trick. The trick is to sample a subset of the huge dataset for one run, each run the sample is different.