[NPU] dump prefill IR for further C++ solution

Description

1. Why the change?

https://github.com/analytics-zoo/nano/issues/1716#issue-2628191642 To support pure c++ NPU solution, we need to provide a "compile" tool for user to save all needed files (IR / bin / blob).

2. User API changes

Added two params:

compile_full_model: if set to True, we will save prefill related IR or bin files, default to False
save_directory: directory used to save all needed files (IR / bin / blob), default to None

If we just want to do inference at python side, usage is not changed

model = AutoModelForCausalLM.from_pretrained(model_path,
                                           optimize_model=True,
                                           pipeline=True,
                                           load_in_low_bit=args.load_in_low_bit,
                                           max_context_len=args.max_context_len,
                                           max_prompt_len=args.max_prompt_len,
                                           quantization_group_size=args.quantization_group_size,
                                           torch_dtype=torch.float16,
                                           attn_implementation="eager",
                                           transpose_value_cache=not args.disable_transpose_value_cache,
                                           mixed_precision=True,
                                           trust_remote_code=True)

If we want to dump files to do further C++ inference

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             optimize_model=True,
                                             pipeline=True,
                                             load_in_low_bit=args.load_in_low_bit,
                                             max_context_len=args.max_context_len,
                                             max_prompt_len=args.max_prompt_len,
                                             quantization_group_size=args.quantization_group_size,
                                             torch_dtype=torch.float16,
                                             attn_implementation="eager",
                                             transpose_value_cache=not args.disable_transpose_value_cache,
                                             mixed_precision=True,
                                             trust_remote_code=True,
                                             compile_full_model=True,
                                             save_directory=save_dir)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.save_pretrained(save_dir)

3. Summary of the change

Added two params compile_full_model / save_directory to dump all needed files for further c++ inference support
Sort out what files are necessary
Code refactor

4. Verify correctness

[x] Qwen2.5 7B pipeline python CW
[x] Qwen2.5 7B pipeline c++ CW
[x] Qwen2 1.5B pipeline python CW
[x] Qwen2 1.5B pipeline c++ CW

Only update qwen2 for now, can be extended to other models later.

intel-analytics / ipex-llm

[NPU] dump prefill IR for further C++ solution #12402

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. Verify correctness