Open VeritasJoker opened 9 months ago
podcast or tfs? I believe podcast..please confirm @VeritasJoker
podcast : )
note: had to symlink facebook/opt-6.7b
, facebook/opt-13b
and facebook/opt-30b
into .cache/hub
. Probably, this is the case with all faacbook/*
models. This is not the case with meta/*
models (i.e. llama)
Update:
Need quantized version (prioritize the biggest ones first):
Would be nice if we have:
Thanks Harsha!
Quantized models - Completion Status
Note: All embeddings are in /scratch/gpfs/hgazula/247-pickling/results/podcast/661/embeddings_quantized
Still needed but not as urgent, quantized models for all gpt2 and get-neo
EleutherAI/gpt-neox-20b
is failing with Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
For more info see /scratch/gpfs/hgazula/247-pickling/logs/quantized-661-EleutherAI-gpt-neox-20b-cnxt-2048.err
as well as https://github.com/huggingface/accelerate/issues/2084
Hi Harsha,
Can you tell me which quantization scheme did you use? 4bit or 8 bit? And did you do it simply with BitsAndBytesConfig or some specialized method?
Thanks, Daria
Simply, BitsAndBytesConfig. I can push my local changes in a couple of hours if you want to see exactly where and how.
Great :) So is it 4 or 8 bit? Would appreciate it if you would push your changes when you have the time
4 bit. Here's an example
from transformers import BitsAndBytesConfig
# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16)
CONFIG.model = AutoModelForCausalLM.from_pretrained(
model_name, token=token, device_map="auto",
output_hidden_states=True,
quantization_config=bnb_config,
)
@hvgazula I'm trying to run some code with this config on della cluster, and get this error:
ImportError: Using
low_cpu_mem_usage=Trueor a
device_maprequires Accelerate:
pip install accelerate`
even though I have accelerate (and bitsandbytes) library installed in my conda env
Have you encountered this issue? Could this be CPU/GPU problem? If so, do you know if MIG GPU should be enough for running LLama2-7B on the Podcast data?
what's the path to your conda environment? which python
. Also, I never downloaded the data on my personal computer, so I cannot answer the second question about M1 for llama2.
This is my conda env: /home/dl3994/.conda/envs/247-env/lib/python3.10
Tried running on della MIG GPU (according to instructions here https://researchcomputing.princeton.edu/systems/della#gpus), and still get the same error.
The weird thing is the same code runs fine on google colab.
why is your python pointing to lib
folder and not bin
? That is weird as executables sit in bin and not in lib.
can you try activating my environment and see what's going on? conda activate /home/hgazula/.conda/envs/247-main
. If this doesn't work, let's schedule a time to meet and resolve all issues at once.
using your env solved the accelerate issue, but now i get this error when trying to load llama model:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /meta-llama/Llama-2-7b-hf/resolve/main/tf_model.h5 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x149f1dbcbb50>: Failed to resolve 'huggingface.co' ([Errno -2] Name or service not known)"))
are you familiar with it? if you have the time for a zoom meeting, that would be great :) my email is daria.lioubashevsky@mail.huji.ac.il
that's because there is no internet connection on the compute node. Before running your job on the compute node, you must first download/cache the model on the head node.
Done:
Needed:
Would be nice if we have (in order):
Thanks Harsha!