replit / ReplitLM

Inference code and configs for the ReplitLM model family
https://huggingface.co/replit
Apache License 2.0
918 stars 75 forks source link

cuda use and out of memory #12

Closed titoBouzout closed 1 year ago

titoBouzout commented 1 year ago

Hey! so, to use cuda,

I had to go here: https://developer.nvidia.com/cuda-downloads

then uninstall torch pip uninstall torch

then download torch with cuda from here https://pytorch.org/get-started/locally/

but now I am getting

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 8.00 GiB total capacity; 7.30 GiB already allocated; 0 bytes free; 7.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I couldn't figure out how to fix that error. Any clues? I'm on Windows 10 laptop with a 3070.

Im also not sure if the configuration is still correct if I try to run it with cuda. As I have to change the device. Im using the following code as a test.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda:0"
dtype = torch.int8

tokenizer = AutoTokenizer.from_pretrained(
    "replit/replit-code-v1-3b", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "replit/replit-code-v1-3b",
    trust_remote_code=True,
    # attn_impl="triton",
    # init_device="meta",
    init_device=device,
)

model.to(device=device, dtype=dtype)

x = tokenizer.encode("def fibonacci(n): ", return_tensors="pt")
x = x.to(device=device, dtype=dtype)
y = model.generate(
    x,
    max_length=100,
    do_sample=True,
    top_p=0.95,
    top_k=4,
    temperature=0.2,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(
    y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(generated_code)

The config seems to be the default from config.json

Thanks!

madhavatreplit commented 1 year ago

Thanks for your issue!

Quick questions:

  1. in your snippet above, I am seeing
    # mode.to(device, dtype=torch.bfloat16)
    mode.to(device)

Assuming "mode" to be "model" here. Can you try with mode.to(device, dtype=torch.bfloat16), i.e. put the model in bfloat16 precision on the GPU?

  1. What's the VRAM on your 3070? 8GB?
Symbolk commented 1 year ago

This is expected, since 3070 only has a 8GB memory, but for replit-2.7B by default, the required memory is approximately 10.8GB(2.7B*4). Using bf16 may work, since then the required memory is approximately 5.4GB(2.7B*2).

P.S. In a nutshell, to load a model on a GPU device each billion parameters costs 4GB in float32 precision, 2GB in (b)float16, and 1GB in int8. See also here: https://huggingface.co/blog/trl-peft

titoBouzout commented 1 year ago

Thanks, I updated the script on my early comment to include int8 and still I'm getting the same error. which is weird.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 8.00 GiB total capacity; 7.30 GiB already allocated; 0 bytes free; 7.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

there's an early warning too:

UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn(

titoBouzout commented 1 year ago

I discovered the issue. I was playing with this model https://huggingface.co/4bit/Replit-v1-CodeInstruct-3B and worked. So I gave it another try to the original with the same configuration as the other one and turns out, there's torch_dtype=torch.bfloat16, missing on AutoModelForCausalLM.from_pretrained. Now it works on the same machine. :)

The complete script

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

data = "replit/replit-code-v1-3b"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(data, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    data,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    init_device=device,
)

model.to(device)

def codegenerator(s):
    x = tokenizer.encode(s, return_tensors="pt")
    x = x.to(device)
    y = model.generate(
        x,
        do_sample=True,
        use_cache=True,
        max_new_tokens=768,
        temperature=0.2,
        top_p=0.9,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
    return tokenizer.decode(
        y[0][x.shape[-1] :],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )

print(codegenerator("def fibonacci(n): "))
print(codegenerator(" function reverseString(s) "))
pirroh commented 1 year ago

Glad you found how to fix your issue. Closing for now!