openvinotoolkit / openvino_notebooks

📚 Jupyter notebook tutorials for OpenVINO™
Apache License 2.0
2.3k stars 791 forks source link

iGPU memory usage problem #2241

Closed NNsauce closed 3 weeks ago

NNsauce commented 1 month ago

Describe the bug My hardware: Nezha develop toolkit board with N97 processor and intel UHD Graphics My oper sys: Ubuntu 22.04 My software: python3.10 I am running llm-chatbot with qwen1.5-0.5b model (INT8). There is no any wrong behavior when the model is loaded on cpu or iGPU. iGPU means that there is no independent memory, so the model has to be loaded on the RAM shared with system what I could NOT understand is that: the mem usage is about 400MB which is just similar to the org converted intel IR model size(.bin) when I load the model on CPU, but it raise to about 2GB when I load the exact same model on iGPU

I been kept searching everywhere for 3 days, and no expected answer was found have you met this ? or anyone could explain me this problem?

Iffa-Intel commented 1 month ago

@NNsauce are you sure that there are no other programs or softwares that use iGPU running in the background?

brmarkus commented 1 month ago

@NNsauce do you see the increase of memory usage only when loading the model for the first time? The model first needs to get "compiled" first... the compiled shaders (and/or blob) could be stored locally persistently (or temporary only). Then starting the application the next time could benefit from the pre-compiled shaders (and/or blob) and could be loaded into the iGPU directly.

NNsauce commented 1 month ago

@brmarkus do you mean cache property, Intel set cache only for reduce compile time when using gpu without mentioning memory increase.

and every time I load the model on igpu, it just will reach a same memory usage (qwen-0.5b:2GB )

NNsauce commented 1 month ago

@Iffa-Intel yeah, I am sure for that. Even though there is a usage somehow in the background, what I count is the delta usage,not the total usage

brmarkus commented 1 month ago

@brmarkus do you mean cache property, Intel set cache only for reduce compile time when using gpu without mentioning memory increase.

and every time I load the model on igpu, it just will reach a same memory usage (qwen-0.5b:2GB )

Yes, after compiling storing/caching the compiled model; you could even export and import a blob. Do you see a difference in memory-usage when - e.g. in a new life-cycle, after restarting - loading the already compiled model (or importing the blob) into the iGPU?

NNsauce commented 1 month ago

@brmarkus en..uh ,I've done a try. No any difference . But I find out that cache file is about 3 times of the org int8 model. yesterday, I saw an non-official article showed some test conclusions.one of them is : Intel will add some memory for iGPU as dynamic memory, maybe its true...

brmarkus commented 1 month ago

CPUs can greatly benefit from INT8 (or INT4) quantization, but not the iGPU, I think (the iGPU doesn't have those special AVX-VNNI instructions... the compiled shaders might use (much) more "FakeQuantize" operations?)

Iffa-Intel commented 3 weeks ago

Closing issue, feel free to re-open or start a new issue if additional assistance is needed.