add missing cmake flags to build llama runner for android

salykova commented 1 month ago

This PR fixes the android build failure for llama3-instruct https://github.com/pytorch/executorch/issues/3601

The cmake flags

-DEXECUTORCH_BUILD_XNNPACK=ON
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON

are missing in the Llama's readme when building executorch and llama runner for android. -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON is required, otherwise the model fails during runtime.

Tested and verified on a Samsung S23.

(executorch) salykova@xmachine ~/Projects/executorch (main) $ adb shell "cd /data/local/tmp/llama && ./llama_main --model_path /data/local/tmp/llama/llama3_kv_sdpa_xnn_qe_4_32.pte --tokenizer_path /data/local/tmp/llama/tokenizer.bin --prompt '<|begin_of_text|> <|start_header_id|>User<|end_header_id|>Why is Pytorch so fast?<|eot_id|>' --seq_len 256"
I 00:00:00.011654 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.012361 executorch:main.cpp:65] Resetting threadpool with num threads = 4
I 00:00:00.021018 executorch:runner.cpp:54] Creating LLaMa runner: model_path=/data/local/tmp/llama/llama3_kv_sdpa_xnn_qe_4_32.pte, tokenizer_path=/data/local/tmp/llama/tokenizer.bin
I 00:00:05.708101 executorch:runner.cpp:69] Reading metadata from model
I 00:00:05.708208 executorch:runner.cpp:134] get_vocab_size: 128256
I 00:00:05.708228 executorch:runner.cpp:134] get_bos_id: 128000
I 00:00:05.708244 executorch:runner.cpp:134] get_eos_id: 128001
I 00:00:05.708259 executorch:runner.cpp:134] get_n_bos: 1
I 00:00:05.708274 executorch:runner.cpp:134] get_n_eos: 1
I 00:00:05.708293 executorch:runner.cpp:134] get_max_seq_len: 128
I 00:00:05.708308 executorch:runner.cpp:134] use_kv_cache: 1
I 00:00:05.708324 executorch:runner.cpp:134] use_sdpa_with_kv_cache: 1
I 00:00:05.708339 executorch:runner.cpp:134] append_eos_to_prompt: 0
<|begin_of_text|> <|start_header_id|>User<|end_header_id|>Why is Pytorch so fast?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

PyTorch is considered fast for several reasons:

1. **Just-In-Time (JIT) compilation**: PyTorch has a built-in JIT compiler that can compile PyTorch models into optimized machine code at runtime. This allows PyTorch to skip the overhead of interpreted Python code and execute the model directly on the CPU or GPU, making it faster and more efficient.
2. **Autograd**: PyTorch's autograd system automatically calculates gradients, which is a crucial component of backpropagation. This means that PyTorch canI 00:00:22.615399 executorch:runner.cpp:419]         Prompt Tokens: 14    Generated Tokens: 113
I 00:00:22.615425 executorch:runner.cpp:425]    Model Load Time:                6.028000 (seconds)
I 00:00:22.615450 executorch:runner.cpp:435]    Total inference time:           16.564000 (seconds)              Rate:  6.822024 (tokens/second)
I 00:00:22.615465 executorch:runner.cpp:443]            Prompt evaluation:      1.420000 (seconds)               Rate:  9.859155 (tokens/second)
I 00:00:22.615485 executorch:runner.cpp:454]            Generated 113 tokens:   15.144000 (seconds)              Rate:  7.461701 (tokens/second)
I 00:00:22.615501 executorch:runner.cpp:462]    Time to first generated token:  1.522000 (seconds)
I 00:00:22.615514 executorch:runner.cpp:469]    Sampling time over 127 tokens:  0.281000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1715721953250,"model_load_end_ms":1715721959278,"inference_start_ms":1715721959278,"inference_end_ms":1715721975842,"prompt_eval_end_ms":1715721960698,"first_token_ms":1715721960800,"aggregate_sampling_time_ms":281,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

pytorch-bot[bot] commented 1 month ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3611

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit c5ec15a2cf0951fdb5af3a40ee8a904f19a7c1e0 with merge base e8a520c4f37faf378da708006f090530052fce29 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot commented 1 month ago

@kimishpatel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

salykova commented 1 month ago

@kimishpatel you are very welcome! Could you pls review the code so that the PR can be merged? I've also opened two additional PRs https://github.com/pytorch/executorch/pull/3619, https://github.com/pytorch/executorch/pull/3618 to fix the Llama Android Demo. The PR https://github.com/pytorch/executorch/pull/3619 is critical because it fixes JNI build failure (at least on my side). Could you please also approve and review them?

Maybe include https://github.com/pytorch/executorch/pull/3611 and https://github.com/pytorch/executorch/pull/3619 into 0.2.1 release?

facebook-github-bot commented 1 month ago

@kimishpatel merged this pull request in pytorch/executorch@8ddf836c2db7a6ed4c645fb246be5e495cc45d97.

pytorch / executorch