minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
MIT License
1.83k stars 219 forks source link

Building TokenDataset consumes excessive amounts of RAM #18

Open trisongz opened 4 years ago

trisongz commented 4 years ago

image

Following the method to build and cache a line-by-line TokenDataset with a single text file as input. Crashed with OOM. Attempting to reduce by half file size.

Data Specs: 17.1 gb - > 15494632 lines Tokenizer: 350k Vocab Size Merges: 4.9mb Vocab: 6.4mb Max Length: 512

VM Specs: AWS g4dn-16xlarge [64 vCPU/256 GB RAM/9 GB SWAP]

Possibly write to file as optional checkpoint batching step or deleting vars from text_list as they're tokenized to prevent python from holding onto memory?

minimaxir commented 4 years ago

I definitely have not tested loading that much data (also idk if you can even train GPT-2 with a 350k vocab size, that's insane).

There have been other reports of memory issues on larger datasets, however that's governed by batch_encode_plus() which I don't control. Wonder if I have to batch the batch encoding.

minimaxir commented 4 years ago

It is possible to read files and batch them iteratively; that may be the most scalable option for the time being.

The catch is this blocks shuffle: which I am OK with as line-by-line datasets can be shuffled by the user beforehand if absolutely necessary.

minimaxir commented 4 years ago

...and it's a good excuse to implement a pretty progress bar! :D

trisongz commented 4 years ago

Was able to get it to train with batch-size 1 on the 50% split dataset (8.6gb -> 7747316 lines)

Here's the configs:

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 512,
  "n_embd": 768,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 512,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 350000
}

It runs on about 69gb of RAM using the cached dataset. I do agree batch_encode_plus is super memory inefficient as well. I've ran into memory issues prepping datasets for HF T5.

My bandaid solutions would be to split the big text file into 500mb-1GB chunks, process it, save it as a pickle, and flush the vars. Then final step would be to open all the pickle files and join them together.

Love your GPU bar implementation btw. It should definitely be a standard.

trisongz commented 4 years ago

Also - I've found GPT2 works more effectively with custom one-hot tokens, which I generally build in during pre-processing.

<command> Run this task: ... </command>

Would you consider adding that into the tokenizer?

minimaxir commented 4 years ago

Adding tokens is blocked by https://github.com/huggingface/tokenizers/issues/15 , which appears will be added with the next version of tokenizers. (but will have to wait on the base version of transformers pinning it)

trisongz commented 4 years ago

Tokenizers library's documentation is really sparse and has a lot of nuances that I've ran into as well.

What I've found works really well is following this notebook.

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

Which can then be loaded with AutoTokenizer rather than GPT2FastTokenizer or GPT2Tokenizer. Not sure performance difference down the road yet.

minimaxir commented 4 years ago

The blocking issue is that the added tokens are not saved into vocab.json. (this is different from special_tokens)

trisongz commented 4 years ago

I wrote a custom function for batching, which was able to fit into memory where it previously didn't (even though it's still 150GB+ of memory for a 8GB File, it's less than before with a 1GB file)

import math
import pickle

def split_batches(text_list, tokenizer, num_splits=5):
    total_len = len(text_list)
    cutoff = math.ceil(total_len / num_splits)
    print('Starting Batching with {} per Batch'.format(cutoff))
    if (cutoff * num_splits) >= total_len:
        print('Total Batch Size will result in empty items. Reducing')
        while (cutoff * num_splits) >= total_len:
            cutoff -= 1
            if (cutoff * num_splits) <= total_len:
                print('New Batch Size is {}'.format(cutoff))
                break

    for i in range(num_splits):
        curr_list = text_list[0:cutoff]
        batch_tokens = list(
                itertools.chain.from_iterable(
                    tokenizer.batch_encode_plus(curr_list, add_special_tokens=False)[
                        "input_ids"
                    ]
                )
            )
        pickle.dump(batch_tokens, open('tmp_{}.p'.format(i), 'wb'))
        batch_tot = len(batch_tokens)
        del batch_tokens, curr_list
        text_list = text_list[cutoff:]
        print('Completed Batch {}: {} Tokens. Remaining Items: {}/{}'.format(i, batch_tot, len(text_list), total_len))

    del text_list
    all_tokens = []
    print('Loading Tokens from Batches')
    for i in range(num_splits):
        batch_tokens = pickle.load(open('tmp_{}.p'.format(i), 'rb'))
        all_tokens += batch_tokens
        print('Loaded {} Tokens. Total Tokens: {}'.format(len(batch_tokens), len(all_tokens)))
        del batch_tokens

    print('Completed. Total Tokens: {}'.format(len(all_tokens)))
    return all_tokens

It's probably more verbose than necessary (for debugging), and I didn't delete the tmp files.

It would be a drop-in to replace the existing self.tokens. I'll create a pull request once I make it more reliable.

self.tokens = split_batches(text_list, tokenizer, num_splits=5)
#self.tokens = list(
#    itertools.chain.from_iterable(
#        tokenizer.batch_encode_plus(text_list, add_special_tokens=False)[
#            "input_ids"
#        ]
#    )
#)
minimaxir commented 4 years ago

I am working on this issue currently (tl;dr the batch of batches approach as commented above, which requires a slight refactor).

I'm aiming to finish by this weekend.

trisongz commented 4 years ago

So a note from what I learned on my end. While I was able to batch and create the datacache and max it out RAM consumption at around 10x the file size (with 5 splits), when it goes back to training, and loading from the datacache, it's still pretty RAM intensive.

Original File Size: 8.2gb -> 5,712,986 lines Datacache Size: 1.2gb -> 2,037,896,910 subsets System RAM used during Training: 152GB Vocab Size: 350k

Building New Dataset
Building Dataset from /spell/datacache/training/lm_batch_v2_0_5_gpt2.txt
Using TF.IO
05/22/2020 19:09:49 — INFO — aitextgen.TokenDataset — 5,712,986 texts loaded.
Starting Batching with 1142598 per Batch
Total Batch Size will result in empty items. Reducing
New Batch Size is 1142597
Completed Batch 0: 407517743 Tokens. Remaining Items: 4570389/5712986
Completed Batch 1: 407556109 Tokens. Remaining Items: 3427792/5712986
Completed Batch 2: 407649159 Tokens. Remaining Items: 2285195/5712986
Completed Batch 3: 407574533 Tokens. Remaining Items: 1142598/5712986
Completed Batch 4: 407599878 Tokens. Remaining Items: 1/5712986
Loading Tokens from Batches
Loaded 407517743 Tokens. Total Tokens: 407517743
Loaded 407556109 Tokens. Total Tokens: 815073852
Loaded 407649159 Tokens. Total Tokens: 1222723011
Loaded 407574533 Tokens. Total Tokens: 1630297544
Loaded 407599878 Tokens. Total Tokens: 2037897422
Completed. Total Tokens: 2037897422
05/22/2020 19:28:32 — INFO — aitextgen.TokenDataset — Caching and compressing dataset to /spell/datacache/aitextgen/nsfgpt2_lm_v2/dataset_cache.tar.gz
CPU times: user 3h 58s, sys: 13min 49s, total: 3h 14min 48s
Wall time: 23min 38s
minimaxir commented 4 years ago

RAM utilization while training is a separate issue, and one I have less control over.

minimaxir commented 4 years ago

The new implementation in https://github.com/minimaxir/aitextgen/commit/de06b3f35a7fa0d5bf992734444b0ebaeb094489 uses theoretically O(1)/constant memory, but need to test it.

trisongz commented 4 years ago

The new implementation in building the Dataset works great. It uses about 8x for a 1.2gb file.

05/26/2020 06:28:26 — INFO — aitextgen.TokenDataset — Encoding 1,142,599 sets of tokens from /content/datasets/input.txt.
05/26/2020 06:35:28 — INFO — aitextgen.TokenDataset — Caching and compressing dataset to /content/datasets/dataset_cache.tar.gz

CPU times: user 21min 20s, sys: 17.3 s, total: 21min 37s
Wall time: 7min 36s
TokenDataset containing 349,712,618 subsets loaded via cache.

However, the training still crashes with OOM. I'm currently looking into (and stuck on) DataLoaders and collate functions, which would sample and batch the dataset rather than load them all into memory, since in both GPU and TPU training, the process can be mapped to CPU, and subsequently offload it from memory.

https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/samplers/bptt_sampler.py https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd

Trouble I'm having is tokenizing the batches to return them to the model at runtime with the proper block size. Any ideas on that?

DenisSergeevitch commented 4 years ago

@trisongz I have the same problem, 64 gb of RAM is not enough, with those settings:

config = build_gpt2_config(vocab_size=50000, max_length=1024, dropout=0.0, n_embd=512, n_layer=8, n_head=8)

ai.train(data, num_steps=500000, generate_every=500, save_every=1000, save_gdrive=False, line_by_line=True, num_workers=32, batch_size=8 )

Have not you found a solution for this? I even cant start GPU training, it freezes with OOM too :(

minimaxir commented 4 years ago

(it'll be fixed when the numpy-dataset branch is merged in a day or two: doing final tests soon)

minimaxir commented 4 years ago

v0.2.0 is released: see if that helps! See the release notes for more info.

It used numpy as a backend, and it will be very difficult to use less memory than what is already used now.

DenisSergeevitch commented 4 years ago

@minimaxir Work like a charm, thank you!

3bl3gamer commented 4 years ago

Problem is still here.

Data Specs: 565 mb, 9815684 lines Tokenizer: 10k Vocab Size Merges: 149 kb Vocab: 207 kb Max Length: 256 RAM: 32GB + 64GB zRAM

TokenDataset consumes all memory and crashes.

ckoshka commented 3 years ago

Hi everyone, I have a notebook with a temporary solution to this issue here: https://github.com/fastelectronicvegetable/aitextgen_notebooks/blob/main/Encoding_very_large_text_files%20(2).ipynb

It uses a much more efficient training process and tokenisation process, I was able to convert a 20GB Vietnamese file on a Google Colab P100 just fine with very minimal RAM usage. I also figured out how to turn the vocab file output into an aitextgen.tokeniser.json, and how to convert the YTTM-encoded texts into a format that GPT-2 can read.

Two potential issues:

  1. I added escape characters for " and \ in converting it to a json file but it might throw up errors. In that case, you would want to use Huggingface's BPE trainer and specify the vocab as special tokens, but that's limited to around 13,000 tokens before it throws up a memory error.
  2. The way I'm reading and saving the encodings as a npy file at the end is inefficient. It can probably be sped up through simultaneous instead of sequential execution but I don't have the know-how to implement it, I only started learning python properly two days ago.