Open quic-agokhale opened 2 weeks ago
Linter checks and DCO is failing: Could you please do the following and repush to format the code:
- pip install pre-commit
- pre-commit install
- git commit -m ...
just added unit tests with below results:
(qeff_env) eplatero@aus121-r760-0:/prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers$ pytest tests/spd/test_tlm_dlm_export_and_compile.py
================================================================================================================================= test session starts ==================================================================================================================================
platform linux -- Python 3.8.20, pytest-8.3.3, pluggy-1.5.0 -- /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers/qeff_env/bin/python3.8
cachedir: .pytest_cache
rootdir: /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers
configfile: pyproject.toml
collected 2 items
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 96579.37it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
=============== PyTorch vs. fp32 ONNXRT (MAD) ===============
logits 1.33514404296875e-05
past_keys (mean) 6.141860715367577e-07
past_value (mean) 4.351139068603516e-06
=====================================================================
Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
===================== Compilation Done! =====================
PASSED
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 52996.62it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
=============== PyTorch vs. fp32 ONNXRT (MAD) ===============
logits 1.33514404296875e-05
past_keys (mean) 6.141860715367577e-07
past_value (mean) 4.351139068603516e-06
=====================================================================
Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
===================== Compilation Done! =====================
PASSED
============================================================================================================================ 2 passed in 551.93s (0:09:11) =============================================================================================================================
to integrate SpD changes, changed the api slightly from:
# tlm
tlm = QEFFAutoModelForCausalLM.from_pretrained()
tlm.transform(num_speculative_tokens=)
tlm.export_and_compile()
# dlm
dlm = QEFFAutoModelForCausalLM.from_pretrained()
dlm.transform(is_dlm=True)
dlm.export_and_compile()
to
# tlm
tlm = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_speculative_tokens=)
tlm.export_and_compile()
# dlm
dlm = QEFFAutoModelForCausalLM.from_pretrained(model_name, is_dlm=True)
dlm.export_and_compile()
did this change because from_pretrained()
automatically calls the transform
function, which then sets the is_transformed
member variable to True. Thus, this does it all in one step.
Once llama changes have been approved, the plan is to make corresponding changes to the rest of the supported models along with their unit testing.
also, we are still discussing where to best put a documentation for these SpD changes... we are discussing maybe updating the transform
doc with two parameters: num_speculative_tokens
and is_dlm
or possibly adding a new document explaining this... would appreciate y'alls thoughts on this
@ochougul @irajagop @quic-rishinr Could you all pls review this PR.
@quic-rishinr, thank you for the feedback. I have updated the changes. please let me know what you think.
I explicitly added num_speculative_tokens
and is_dlm
to transform
method to create some documentation on how to create SpD model.
Validation to show unit tests are passing for CB SpD are below:
=============================================================================================================================================================================== test session starts ===============================================================================================================================================================================
platform linux -- Python 3.8.20, pytest-8.3.3, pluggy-1.5.0 -- /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers/qeff_env/bin/python3.8
cachedir: .pytest_cache
rootdir: /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers
configfile: pyproject.toml
collected 2 items
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 126009.13it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
=============== PyTorch vs. fp32 ONNXRT (MAD) ===============
logits 1.33514404296875e-05
past_keys (mean) 6.141860715367577e-07
past_value (mean) 4.351139068603516e-06
=====================================================================
Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
===================== Compilation Done! =====================
PASSED
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 51964.83it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
=============== PyTorch vs. fp32 ONNXRT (MAD) ===============
logits 1.33514404296875e-05
past_keys (mean) 6.141860715367577e-07
past_value (mean) 4.351139068603516e-06
=====================================================================
Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
===================== Compilation Done! =====================
PASSED
========================================================================================================================================================================== 2 passed in 575.63s (0:09:35) ==========================================================================================================================================================================
tomorrow, will be adding unit tests to validate this functionality on non-CB to make sure it works as well.
Added unit test that also tests non-CB model with SpD. Passing of all four tests are shown below:
$ pytest -rA tests/spd/test_tlm_dlm_export_and_compile.py
==================================================================================================== PASSES =====================================================================================================_______________________________________________________________________________________ test_llama_tlm_logit_dims[llama0] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_tlm_logit_dims[llama1] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_dlm_logit_dims[llama0] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_dlm_logit_dims[llama1] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============================================================================================ short test summary info ============================================================================================PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama0]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama1]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama0]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama1]
======================================================================================== 4 passed in 1123.57s (0:18:43) =========================================================================================
The non-CB DLM test is essentially testing the vanilla non-CB workflow as the only thing this does is add an extra specialization.
Thus, together these unit tests cover for SpD changes as well as keeping backward compatibility.
Let me know if this is sufficient testing @quic-rishinr, @ochougul , @irajagop, @vbaddi.
Once approved, I can move on to implement SpD changes on the rest of supported models.
Hi @eplatero97 SpD support hasn’t been added to the CLI APIs like infer. Could you please add support for SqD in the CLI API as well?
(https://jira-dc.qualcomm.com/jira/browse/CLOUDPERF-43) This change has been validated and posted on behalf of Erick Platero.
It adds support for generating a Target LM to run as a verifier model by outputting all logits instead of just that of the last position for the input sequence.
It also allows compiling the Target and Draft LMs with specializations that support SpD
Usage:
TLM: tlm = QEFFAutoModelForCausalLM.from_pretrained()
tlm.transform(num_speculative_tokens=)
tlm.export_and_compile()
DLM: dlm = QEFFAutoModelForCausalLM.from_pretrained()
dlm.transform(is_dlm=True)
dlm.export_and_compile()