Extract embeddings for large language models

VeritasJoker commented 9 months ago

Done:

meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
facebook/opt-125m
facebook/opt-350m
facebook/opt-1.3b
facebook/opt-2.7b

Needed:

facebook/opt-6.7b
facebook/opt-13b
facebook/opt-30b (If not possible then move this to the next category)

Would be nice if we have (in order):

quantized version of all opt (in order to use the 60b)
quantized version of all llama-2 (in order to use the 70b)
quantized version of all llama-1 (in order to use the 60b)
opt-175b (I couldn't find the model on huggingface at the moment)

Thanks Harsha!

hvgazula commented 9 months ago

podcast or tfs? I believe podcast..please confirm @VeritasJoker

VeritasJoker commented 9 months ago

podcast : )

hvgazula commented 9 months ago

note: had to symlink facebook/opt-6.7b, facebook/opt-13b and facebook/opt-30b into .cache/hub. Probably, this is the case with all faacbook/* models. This is not the case with meta/* models (i.e. llama)

VeritasJoker commented 8 months ago

Update:

Need quantized version (prioritize the biggest ones first):

meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
meta-llama/Llama-2-70b-hf
facebook/opt-125m
facebook/opt-350m
facebook/opt-1.3b
facebook/opt-2.7b
facebook/opt-6.7b
facebook/opt-13b
facebook/opt-30b
facebook/opt-66b

Would be nice if we have:

quantized version of all gpt-neo(gpt-neo-125m, gpt-neo-1.3b, gpt-neo-2.7b, gpt-neox-20b)
quantized version of all gpt-2 (distilgpt2, gpt2, gpt2-medium, gpt2-large, gpt2-xl)
quantized version of all llama-1 (7b, 13b, 33, 66b)

Thanks Harsha!

hvgazula commented 8 months ago

Quantized models - Completion Status

[x] meta-llama/Llama-2-7b-hf
[x] meta-llama/Llama-2-13b-hf
[x] meta-llama/Llama-2-70b-hf
[x] facebook/opt-125m
[x] facebook/opt-350m
[x] facebook/opt-1.3b
[x] facebook/opt-2.7b
[x] facebook/opt-6.7b
[x] facebook/opt-13b
[x] facebook/opt-30b
[x] facebook/opt-66b
[ ] opt175b

Note: All embeddings are in /scratch/gpfs/hgazula/247-pickling/results/podcast/661/embeddings_quantized

VeritasJoker commented 7 months ago

Still needed but not as urgent, quantized models for all gpt2 and get-neo

[ ] distilgpt2
[ ] gpt2
[ ] gpt2-medium
[ ] gpt2-large
[ ] gpt2-xl
[ ] EleutherAI/gpt-neo-125M
[ ] EleutherAI/gpt-neo-1.3B
[ ] EleutherAI/gpt-neo-2.7B
[ ] EleutherAI/gpt-neox-20b

hvgazula commented 7 months ago

EleutherAI/gpt-neox-20b is failing with Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

For more info see /scratch/gpfs/hgazula/247-pickling/logs/quantized-661-EleutherAI-gpt-neox-20b-cnxt-2048.err as well as https://github.com/huggingface/accelerate/issues/2084

daria-lioubashevski commented 5 months ago

Hi Harsha,

Can you tell me which quantization scheme did you use? 4bit or 8 bit? And did you do it simply with BitsAndBytesConfig or some specialized method?

Thanks, Daria

hvgazula commented 5 months ago

Simply, BitsAndBytesConfig. I can push my local changes in a couple of hours if you want to see exactly where and how.

daria-lioubashevski commented 5 months ago

Great :) So is it 4 or 8 bit? Would appreciate it if you would push your changes when you have the time

hvgazula commented 5 months ago

4 bit. Here's an example

from transformers import BitsAndBytesConfig

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16)

CONFIG.model = AutoModelForCausalLM.from_pretrained(
        model_name, token=token, device_map="auto",
        output_hidden_states=True,
        quantization_config=bnb_config,
    )

daria-lioubashevski commented 3 months ago

@hvgazula I'm trying to run some code with this config on della cluster, and get this error: ImportError: Usinglow_cpu_mem_usage=Trueor adevice_maprequires Accelerate:pip install accelerate` even though I have accelerate (and bitsandbytes) library installed in my conda env

Have you encountered this issue? Could this be CPU/GPU problem? If so, do you know if MIG GPU should be enough for running LLama2-7B on the Podcast data?

hvgazula commented 3 months ago

what's the path to your conda environment? which python. Also, I never downloaded the data on my personal computer, so I cannot answer the second question about M1 for llama2.

daria-lioubashevski commented 3 months ago

This is my conda env: /home/dl3994/.conda/envs/247-env/lib/python3.10

Tried running on della MIG GPU (according to instructions here https://researchcomputing.princeton.edu/systems/della#gpus), and still get the same error.

The weird thing is the same code runs fine on google colab.

hvgazula commented 3 months ago

why is your python pointing to lib folder and not bin? That is weird as executables sit in bin and not in lib.

hvgazula commented 3 months ago

can you try activating my environment and see what's going on? conda activate /home/hgazula/.conda/envs/247-main. If this doesn't work, let's schedule a time to meet and resolve all issues at once.

daria-lioubashevski commented 3 months ago

using your env solved the accelerate issue, but now i get this error when trying to load llama model:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /meta-llama/Llama-2-7b-hf/resolve/main/tf_model.h5 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x149f1dbcbb50>: Failed to resolve 'huggingface.co' ([Errno -2] Name or service not known)"))

are you familiar with it? if you have the time for a zoom meeting, that would be great :) my email is daria.lioubashevsky@mail.huji.ac.il

hvgazula commented 3 months ago

that's because there is no internet connection on the compute node. Before running your job on the compute node, you must first download/cache the model on the head node.

hassonlab / 247-pickling

Extract embeddings for large language models #168