Profiling and Allocation Control for Partially Offloaded Models

BICHENG commented 4 months ago

Problem Description During the testing of my diffuser model using the Intel NPU Acceleration Library, I noticed that sometimes the entire model is not fully offloaded to the NPU. Instead, a significant portion of the model remains on the CPU without any errors being reported. This can lead to unexpectedly poor performance.

Desired Solution To address this issue, I suggest introducing a profiler that identifies the most time-consuming operations within the model as well as the "endpoints/breakpoints" where the transformation fails. Based on the profiling results, provide a method for users to manually decompose the model.

I've noticed that there are quite a few NumPy operations during the compilation process, which don't seem to be torch-internal compilation.

Additionally, I acknowledge that this request might further complicate the existing issue: https://github.com/intel/intel-npu-acceleration-library/issues/26

Alternative Approaches An alternative approach could be to improve the existing offloading mechanism to ensure that the entire model is progressively decomposed into complete blocks and offloaded to the NPU whenever possible.

For the remaining parts that cannot be offloaded, they should be left unchanged in terms of operators and keep their original device.

Additional Context This requirement might significantly increase your workload, but the feature would be useful for large models or scenarios where NPU offloading is not entirely successful. It allows for fine-grained control over the execution of individual operations, which can be crucial for optimizing performance.

At the very least, users should be informed about cases where the transformation fails. Alternatively, they could leverage this "feature" to avoid expecting the NPU to handle low-performance operators without their knowledge.

As a wild idea, maybe you could achieve an iGPU+NPU combo by directly using IPEX (intel/intel-extension-for-pytorch) in some magical way?🤔

Also, I'm very curious about the connection between this work and OpenVINO.

alessandropalla commented 3 months ago

Profiling is already implemented as we support torch.profile (for an example implementation look at profile_llm script). I agree we should provide more control to the users, both in quantization (now we support neural compressor API should give the user the ability to select quantization scheme) and in general model compilation.

alessandropalla commented 3 months ago

Also, I'm very curious about the connection between this work and OpenVINO.

OpenVINO it is used as a backend for NPU operations. For more info please tune in to the webinar I'll do next Wednesday about this:

BICHENG commented 3 months ago

I will continue to look for ways to identify the parts of the model that need to be "cut out". Thank you for the recent updates, great work!

intel / intel-npu-acceleration-library

Profiling and Allocation Control for Partially Offloaded Models #29