In deep learning, the popularity of large models (gpt-3, T5, megatron LM) is growing. However, due to this, the polarization of wealth in AI is intensifying.

As an example that touches very well, take gpt-3, a recently very hot potato. gpt-2 was 6GB on disk and the number of parameters was 1.5B. However, since gpt-3 has 175B parameters, it is assumed that its weight alone will occupy 700GB.

To train or inference through the existing framework, all weights had to be loaded into memory. However, in the case of gpt-3, it is difficult to use 700GB of memory on a general PC.

But matorage can solve this problem. The philosophy of matorage's model storage is not to store one model as a single file, but to store it layer-wise. Therefore, matorage will solve this problem by fetching only the submodel weight acceptable to the PC, loading it into memory, and storing the calculated value in file storage. It has a similar philosophy to pydata/numexpr.

The implementation of this feature is reflected in 0.3.0. In addition, we will implement operations that forward rather than backward and are released first in the pytorch version. Once again, I hope that the future of AI will not be centralized by wealth, but decentralized by collective intelligence.

If you want to know more, please refer to the issue :

openai/gpt-3/issues/1

huggingface/transformers/issues/4658

Note This issue is not using the official gpt-3 weights. Run the test by randomly initializing the model with the same conditions as shown in the picture below.

Write the following code to check the inference time of one transformer layer:

import torch
from transformers.configuration_gpt2 import GPT2Config
from transformers.modeling_gpt2 import Block, GPT2Model

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

if __name__ == '__main__':
    n_ctx = 2048
    n_embd = 12288
    config = GPT2Config(n_embd=n_embd, n_head=96)
    model = Block(n_ctx=n_ctx, config=config)
    print('count_parameters', count_parameters(model))
    # model = GPT2Model(config)
    model.eval()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()

    y = model(torch.ones([1, n_ctx, n_embd]))
    end.record()
    torch.cuda.synchronize()
    print(start.elapsed_time(end))

However, it takes about 44sec for one layer and about 1 hour for a total of 96 layers.

graykode / matorage

support inference large models such as gpt-3 in storage calculation. #16

openai/gpt-3/issues/1

huggingface/transformers/issues/4658