VRAM for gradio demo - Githubissues

TATEXH commented 8 months ago

I would like to run your gradio demo, how much VRAM do I need? I tried to run it with RTX4090 (24GB) but I found it didn't have enough VRAM.

uu95 commented 8 months ago

Added to the above question, I would like to know how to split the model into 2 GPUs in case it does not fit on a 24GB GPU

shansongliu commented 8 months ago

I would like to run your gradio demo, how much VRAM do I need? I tried to run it with RTX4090 (24GB) but I found it didn't have enough VRAM.

About 28GB during the inference. A 32GB V100 will be enough for our 7B MU-LLaMA model.

shansongliu commented 8 months ago

Added to the above question, I would like to know how to split the model into 2 GPUs in case it does not fit on a 24GB GPU

I think you can refer to FSDP strategy, but we haven't tested this. Not sure whether it will work or not. The best way is try to find a 32GB GPU to do the inference.

uu95 commented 8 months ago

Thanks for the reply, but I'm using the accelerate library for model parallelism across multiple GPUs and still running out of memory. Any idea why? Here is the short script that I modified from your demo.py script:

# @uu: helping libraries
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
import torch

import data.utils as data
import llama

# @uu: model splitting
from accelerate import infer_auto_device_map, dispatch_model

model = llama.load(
    model_path="/home/ubaid/.cache/huggingface/hub/MU-LLaMA/checkpoint.pth", 
    llama_dir="/home/ubaid/.cache/huggingface/hub/MU-LLaMA/LLaMA", 
    knn=True, 
    mert_path="/home/ubaid/.cache/huggingface/hub/models--m-a-p--MERT-v1-330M/snapshots/af10da70c94a0c849de9cc94b83e12769c4db499",
    knn_dir="/home/ubaid/Music_Image_crossModal/MuIm_model/Music2Image/M2I_demo/music_captioners/MU_LLaMA/MU_LLaMA/ckpts",
    device="cpu",
)
model.eval()

# @uu: get model device map
device_map = infer_auto_device_map(
    model, 
    # max_memory={0: "20GiB", 1: "20GiB"},
    no_split_module_classes=["TransformerBlock"],
    # dtype=torch.float16,
)

# @uu: use accelerate for model loading
model = dispatch_model(
    model, 
    device_map=device_map
)

inputs = {}
audio = data.load_and_transform_audio_data(
    ['/home/ubaid/Music_Image_crossModal/MuIm_model/Music2Image/M2I_demo/music_captioners/MU_LLaMA/MU_LLaMA/ckpts/musiccap2_test.wav'], 
)
inputs['Audio'] = [audio, 1]

results = model.generate(
    inputs,
    [llama.format_prompt("Describe the music")],
    max_gen_len=256
)
result = results[0].strip()
print(result)

But it throws this error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 282.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 164.31 MiB free; 22.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

uu95 commented 8 months ago

I solved the previous problem, thank you for sharing your insights. I used the float16 version for fitting into 2 RTX 3090, but during the experiments, I found that by changing the text input the model gives random results, which is not desired. Can you explain why?

shansongliu commented 8 months ago

I solved the previous problem, thank you for sharing your insights. I used the float16 version for fitting into 2 RTX 3090, but during the experiments, I found that by changing the text input the model gives random results, which is not desired. Can you explain why?

Large language models in general are difficult to precisely control their outputs, and our model is no exception. Additionally, our model is based on a low-parameter version (7B) of LLaMA-2 and we haven't experimented with larger-scale models. If you're interested, you can try our method with a larger, open-sourced LLM model. Furthermore, we have plans to further optimize our model in the future. Currently, our model may not be able to accurately identify the specific types of instruments contained in the music.

shansongliu / MU-LLaMA

VRAM for gradio demo #20