Closed aquadzn closed 4 years ago
Hi,
Huggingface Tokenizers wen't around when I started the project. Today I'd probably be using it.
I've tokenised a bunch of small files. I see no reason to have one 200Gb file.
Hi,
Huggingface Tokenizers wen't around when I started the project. Today I'd probably be using it.
I've tokenised a bunch of small files. I see no reason to have one 200Gb file.
ok thanks I'll try
and I was asking, if I run with TPU, do I need to do the trick for big dataset that you quote in the GPU version ?
# My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
...
sleep 1
done
?
There is always a separate VM with Dataloader to feed the TPU, that is the current architecture regardless of your project. So, yes you need the trick. The trick is to sample a subset of the huge dataset for one run, each run the sample is different.
Hi, thanks for the implementation.
How did you managed to tokenize your +200 GB of text with YTTM?
I tried with ~150 GB in one text file and I got a memory error
I'm using a n1-standard-8 (8 vCPUs, 30 GB memory) from GCP, maybe I need to get the double?
while it seems to work with Huggingface Tokenizers