Open sgupta1007 opened 5 days ago
I believe it should be "model_type: PHI3_MINI"
model type change resolved this error but lead to FullModelHFCheckpointer.load_checkpoint() got an unexpected keyword argument 'weights_only'
error
@joecummings , have you seen this before?
@sgupta1007 , i am not too familiar with the generate recipe, however, we are working on a V2 of it (https://github.com/pytorch/torchtune/pull/1563). There are opportunities to improve the quantization experience in it.
To unblock you for now, are you able to use generate without the quantization?
I am not able to use generate without quantization.
I will try to explain my approach for generation :
1. Perform phi3 qlora finetuning on 1 epoch
2. Supply the adapter and models weights to checkpointer files in config file
3. Keep the model component as torchtune.models.phi3.qlora_phi3_mini.
4. Run Generation Command tune run generate --config custom_quantization.yaml prompt='Explain some topic'
@sgupta1007 as adapter is already merged why we need to give adapter and model weights??
model:
_component_: torchtune.models.llama3_1.llama3_1_8b
checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /path/output/
checkpoint_files: [
hf_model_0001_0.pt,
hf_model_0002_0.pt,
hf_model_0003_0.pt,
hf_model_0004_0.pt
]
output_dir: /path/output/
model_type: LLAMA3
device: cuda
dtype: bf16
seed: 1234
# Tokenizer arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /path/llama3.1-8b/original/tokenizer.model
# Generation arguments; defaults taken from gpt-fast
prompt: "Tell me a joke?"
instruct_template: null
chat_format: null
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
# It is recommended to set enable_kv_cache=False for long-context models like Llama3.1
enable_kv_cache: True
quantizer: null
I am getting CUDA out of Memory on this in A100 GPU for 8 bn model ... strange!!!
Can you run 'nvidia-smi' and confirm that there isnt any dead process consuming your memory before you run generate.py?
However, there was a known issue where kvcache was in FP32 and was initialized with max_seq_len=131k, consuming a lot of memory before generation started. There were a couple of PRs up to fix this.
I will let @joecummings and @SalmanMohammadi reply, since they were working on this.
Thanks for sharing this info!
Can you run 'nvidia-smi' and confirm that there isnt any dead process consuming your memory before you run generate.py?
However, there was a known issue where kvcache was in FP32 and was initialized with max_seq_len=131k, consuming a lot of memory before generation started. There were a couple of PRs up to fix this.
I will let @joecummings and @SalmanMohammadi reply, since they were working on this.
Thanks for sharing this info!
Yep, this is almost certainly due to the fact that the KV cache is being initialized for 131k context length, which OOMs. Once #1449 lands, we can set a max length on the cache itself so that it doesn't initialize for the whole context length. In the meantime, here are some mitigations:
This should be addressed with #1603 now that #1449 is in.
Hey @apthagowda97 - give this a try on our latest nightly build, it should work for you : )
I have used command
tune run generate --config custom_quantization.yaml prompt='Explain some topic'
to generate inference from finetuned phi3 model through torchtuneConfig custom_quantization.yaml
Error Flagged KeyError: 'PHI3``