[eplatero] Add support for exporting and compiling models for SpD

quic-agokhale commented 2 weeks ago

(https://jira-dc.qualcomm.com/jira/browse/CLOUDPERF-43) This change has been validated and posted on behalf of Erick Platero.

It adds support for generating a Target LM to run as a verifier model by outputting all logits instead of just that of the last position for the input sequence.

It also allows compiling the Target and Draft LMs with specializations that support SpD

Usage:

TLM: tlm = QEFFAutoModelForCausalLM.from_pretrained() tlm.transform(num_speculative_tokens=) tlm.export_and_compile()

DLM: dlm = QEFFAutoModelForCausalLM.from_pretrained() dlm.transform(is_dlm=True) dlm.export_and_compile()

vbaddi commented 2 weeks ago

Linter checks and DCO is failing: Could you please do the following and repush to format the code:

- pip install pre-commit
- pre-commit install
- git commit -m ...

eplatero97 commented 1 week ago

Unit Tests

just added unit tests with below results:

(qeff_env) eplatero@aus121-r760-0:/prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers$ pytest tests/spd/test_tlm_dlm_export_and_compile.py
================================================================================================================================= test session starts ==================================================================================================================================
platform linux -- Python 3.8.20, pytest-8.3.3, pluggy-1.5.0 -- /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers/qeff_env/bin/python3.8
cachedir: .pytest_cache
rootdir: /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers
configfile: pyproject.toml
collected 2 items

tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 96579.37it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

=============== PyTorch vs. fp32 ONNXRT (MAD) ===============

logits           1.33514404296875e-05
past_keys (mean)                 6.141860715367577e-07
past_value (mean)                4.351139068603516e-06

=====================================================================

Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

===================== Compilation Done! =====================

PASSED
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 52996.62it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

=============== PyTorch vs. fp32 ONNXRT (MAD) ===============

logits           1.33514404296875e-05
past_keys (mean)                 6.141860715367577e-07
past_value (mean)                4.351139068603516e-06

=====================================================================

Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

===================== Compilation Done! =====================

PASSED

============================================================================================================================ 2 passed in 551.93s (0:09:11) =============================================================================================================================

API

to integrate SpD changes, changed the api slightly from:

# tlm
tlm = QEFFAutoModelForCausalLM.from_pretrained()
tlm.transform(num_speculative_tokens=)
tlm.export_and_compile()

# dlm
dlm = QEFFAutoModelForCausalLM.from_pretrained()
dlm.transform(is_dlm=True)
dlm.export_and_compile()

to

# tlm
tlm = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_speculative_tokens=)
tlm.export_and_compile()
# dlm
dlm = QEFFAutoModelForCausalLM.from_pretrained(model_name, is_dlm=True)
dlm.export_and_compile()

did this change because from_pretrained() automatically calls the transform function, which then sets the is_transformed member variable to True. Thus, this does it all in one step.

Next Steps

Once llama changes have been approved, the plan is to make corresponding changes to the rest of the supported models along with their unit testing.

also, we are still discussing where to best put a documentation for these SpD changes... we are discussing maybe updating the transform doc with two parameters: num_speculative_tokens and is_dlm or possibly adding a new document explaining this... would appreciate y'alls thoughts on this

vbaddi commented 1 week ago

@ochougul @irajagop @quic-rishinr Could you all pls review this PR.

eplatero97 commented 6 days ago

Feedback

@quic-rishinr, thank you for the feedback. I have updated the changes. please let me know what you think.

I explicitly added num_speculative_tokens and is_dlm to transform method to create some documentation on how to create SpD model.

Validation

Validation to show unit tests are passing for CB SpD are below:

=============================================================================================================================================================================== test session starts ===============================================================================================================================================================================
platform linux -- Python 3.8.20, pytest-8.3.3, pluggy-1.5.0 -- /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers/qeff_env/bin/python3.8
cachedir: .pytest_cache
rootdir: /prj/crd/austin/validation/scratch/users/eplatero/qefficient_spd/efficient-transformers
configfile: pyproject.toml
collected 2 items                                                                                                                                                                                                                                                                                                                                                                 

tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 126009.13it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

=============== PyTorch vs. fp32 ONNXRT (MAD) ===============

logits           1.33514404296875e-05
past_keys (mean)                 6.141860715367577e-07
past_value (mean)                4.351139068603516e-06

=====================================================================

Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

===================== Compilation Done! =====================

PASSED
tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama] WARNING - QEfficient - Updating attn_implementation to be 'eager', got None
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 51964.83it/s]
WARNING - QEfficient - Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============== Diagnostic Run torch.onnx.export version 2.0.0+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

=============== PyTorch vs. fp32 ONNXRT (MAD) ===============

logits           1.33514404296875e-05
past_keys (mean)                 6.141860715367577e-07
past_value (mean)                4.351139068603516e-06

=====================================================================

Running AI 100 compiler: /opt/qti-aic/exec/qaic-exec -m=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/TinyLlama_TinyLlama-1.1B-Chat-v1.0_kv.onnx -aic-hw -aic-hw-version=2.0 -network-specialization-config=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/specializations.json -convert-to-fp16 -retained-state -aic-num-cores=16 -custom-IO-list-file=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx/custom_io_int8.yaml -compile-only -aic-binary-dir=/local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/qpc_16cores_1bs_32pl_128cl_-1mos_8fbs_1devices_mxfp6_mxint8/qpcs/qpcs -mxfp6-matmul
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

===================== Compilation Done! =====================

PASSED

========================================================================================================================================================================== 2 passed in 575.63s (0:09:35) ==========================================================================================================================================================================

Next Steps

tomorrow, will be adding unit tests to validate this functionality on non-CB to make sure it works as well.

eplatero97 commented 5 days ago

Validation

Added unit test that also tests non-CB model with SpD. Passing of all four tests are shown below:

$ pytest -rA tests/spd/test_tlm_dlm_export_and_compile.py
==================================================================================================== PASSES =====================================================================================================_______________________________________________________________________________________ test_llama_tlm_logit_dims[llama0] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING  QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING  QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_tlm_logit_dims[llama1] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING  QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING  QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_dlm_logit_dims[llama0] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING  QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING  QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
_______________________________________________________________________________________ test_llama_dlm_logit_dims[llama1] _______________________________________________________________________________________----------------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------------WARNING  QEfficient:modeling_auto.py:111 Updating attn_implementation to be 'eager', got None
WARNING  QEfficient:export_hf_to_cloud_ai_100.py:354 Overriding /local/mnt/qt_drive/users/eplatero/qeff_cache/TinyLlama/TinyLlama-1.1B-Chat-v1.0/onnx
============================================================================================ short test summary info ============================================================================================PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama0]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_tlm_logit_dims[llama1]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama0]
PASSED tests/spd/test_tlm_dlm_export_and_compile.py::test_llama_dlm_logit_dims[llama1]
======================================================================================== 4 passed in 1123.57s (0:18:43) =========================================================================================

The non-CB DLM test is essentially testing the vanilla non-CB workflow as the only thing this does is add an extra specialization.

Thus, together these unit tests cover for SpD changes as well as keeping backward compatibility.

Let me know if this is sufficient testing @quic-rishinr, @ochougul , @irajagop, @vbaddi.

Once approved, I can move on to implement SpD changes on the rest of supported models.

quic-rishinr commented 4 days ago

Hi @eplatero97 SpD support hasn’t been added to the CLI APIs like infer. Could you please add support for SqD in the CLI API as well?

quic / efficient-transformers