intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.55k stars 236 forks source link

CPU memory leak during inference on Arc A770 #476

Closed plusbang closed 2 months ago

plusbang commented 9 months ago

Describe the bug

I found that the CPU memory increase happens when repeat inference for a long time on Intel Arc A770.

Reproduce code

memory trend:

image

Related code:

import torch
from transformers import AutoModelForCausalLM, LlamaTokenizer
import intel_extension_for_pytorch as ipex
import psutil
import matplotlib.pyplot as plt

memory_usage = []

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf',
                                             trust_remote_code=True,
                                             torch_dtype='auto',
                                             low_cpu_mem_usage=True,
                                             use_cache=True)

tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf', trust_remote_code=True)

model = model.half().to('xpu')

with torch.inference_mode():
    for i in range(100):
        input_ids = tokenizer.encode('What is AI?', return_tensors="pt").to('xpu')
        output = model.generate(input_ids,
                                max_new_tokens=32)
        torch.xpu.synchronize()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print('-'*20, 'Output', '-'*20)
        print(output_str)
        memory_info = psutil.virtual_memory()
        memory_usage.append(memory_info.used/(1024**2))

x = [i for i in range(len(memory_usage))]
plt.plot(x, memory_usage, marker='o', linestyle='-')
plt.xlabel('Inference Count')
plt.ylabel('Used/MB')
plt.title('Used Memory Over Time')
plt.savefig('memory_usage.png')

Versions

Collecting environment information...
PyTorch version: N/A
PyTorch CXX11 ABI: N/A
IPEX version: N/A
IPEX commit: N/A
Build type: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.9.18 (main, Sep 11 2023, 13:41:44)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
Is XPU available: N/A
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 
N/A
Intel OpenCL ICD version: 23.17.26241.33-647~22.04
Level Zero version: 1.3.26241.33-647~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                      6
Model:                           183
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
Stepping:                        1
CPU max MHz:                     5800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5990.40
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       896 KiB (24 instances)
L1i cache:                       1.3 MiB (24 instances)
L2 cache:                        32 MiB (12 instances)
L3 cache:                        36 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.0.110+xpu
[pip3] numpy==1.26.2
[pip3] torch==2.0.1a0+cxx11.abi
[pip3] torchvision==0.15.2a0+cxx11.abi
[conda] intel-extension-for-pytorch 2.0.110+xpu              pypi_0    pypi
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] torch                     2.0.1a0+cxx11.abi          pypi_0    pypi
[conda] torchvision               0.15.2a0+cxx11.abi          pypi_0    pypi

Other relevant libraries::

transformers==4.31.0
YuningQiu commented 9 months ago

Hello, many thanks for raising this issue.

We will try reproducing it and get back to you soon.

devpramod-intel commented 9 months ago

@jingxu10 @YuningQiu I am seeing a similar pattern in system memory usage, it increases over time as we continuously do inference memory_usage

Disty0 commented 8 months ago

Same issue here with Kohya SS Stable Diffusion training. I am training a 512 dim SD 1.5 Lora at 1024x1536, Batch Size 1.

Kohya SS: https://github.com/kohya-ss/sd-scripts/

IPEX specific hijacks to fix dtype errors and make it compatible with CUDA syntax: https://github.com/kohya-ss/sd-scripts/tree/dev/library/ipex

Torch 2.1.0a0+cxx11.abi                                                                                                                                                                                              
Torch backend: Intel IPEX 2.1.10+xpu                                                                                                                                                                                 
Torch detected GPU: Intel(R) Arc(TM) A770 Graphics VRAM 16288 Compute Units 512
Python 3.10.13

OS: Arch Linux x86_64 
Kernel: 6.6.8-arch1-1
CPU: AMD Ryzen 7 5800X3D

Memory usage keeps increasing by 50 MB with each step. Reserved memory looks somewhat fine and stays around 5-12 GB. But virtual memory inflates to well above 400 GB after 7500 steps. Virtual memory usage shouldn't be actually allocated but it keeps itself as allocated even tough Linux detects it as virtual. ipexrun with jemalloc helps a little bit but 400 GB bloat is still too much.

This is how my memory usage looks like (With ipexrun, using jemalloc): Screenshot from 2024-01-08 19-06-28

This is how HTop sees it:

Screenshot from 2024-01-08 19-06-15

This is how GNOME System Monitor sees it:

Screenshot from 2024-01-08 19-05-53

I let it run a little bit more and it died with <Signals.SIGKILL: 9>.

This is how my memory usage looks like after it died: image

(venv) disty:~/Downloads $ python collect_env.py 
/home/disty/Apps/AI/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Collecting environment information...
PyTorch version: 2.1.0a0+cxx11.abi
PyTorch CXX11 ABI: Yes
IPEX version: 2.1.10+xpu
IPEX commit: a12f9f650
Build type: Release

OS: Arch Linux (x86_64)
GCC version: (GCC) 13.2.1 20230801
Clang version: 16.0.6
IGC version: 2024.0.0 (2024.0.0.20231017)
CMake version: version 3.28.1
Libc version: glibc-2.38

Python version: 3.10.13 (main, Aug 26 2023, 15:11:40) [GCC 13.2.1 20230801] (64-bit runtime)
Python platform: Linux-6.6.8-arch1-1-x86_64-with-glibc2.38
Is XPU available: True
DPCPP runtime version: 2024.0
MKL version: 2024.0
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512, gpu_eu_count=512)
Intel OpenCL ICD version: N/A
Level Zero version: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 7 5800X3D 8-Core Processor
CPU family:                         25
Model:                              33
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU(s) scaling MHz:                 80%
CPU max MHz:                        4548.8281
CPU min MHz:                        2200.0000
BogoMIPS:                           6802.23
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                     AMD-V
L1d cache:                          256 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           96 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] lion-pytorch==0.0.6
[pip3] numpy==1.24.4
[pip3] open-clip-torch==2.20.0
[pip3] pytorch-lightning==1.9.0
[pip3] torch==2.1.0a0+cxx11.abi
[pip3] torchmetrics==1.2.1
[pip3] torchvision==0.16.0a0+cxx11.abi
[conda] N/A
Disty0 commented 8 months ago

Using TCMalloc fixes it but Intel Compute Runtime can fail with TCMalloc or Jemalloc on some systems.

Edit: TCMalloc doesn't completely fix it but definitely helps to reduce it. Still over 300 GB with long runs.

sobomax commented 5 months ago

We are observing similar issue here with long-running SpeechT5 TTS A770 models with some custom bells and whistles. The very same code running on the CUDA GPU is not a problem. Stable as a rock. Verified with both heaptrack and memray. No custom allocators were of any use.

ScreenShot1259 ScreenShot1258 ScreenShot1257 ScreenShot1256 ScreenShot1255

That's for comparison how the same code runs on Cuda (note the scale here is MB, not GB like on the previous ones), so those peaks are similar to small blimps on the Intel pictures. Which to me means that the my code DTRT releasing objects to the stack.

ScreenShot1260

P.S. I've just updated to the latest 2.1.20+xpu and all the 2024.1 as recommended and it did not help.

chao-camect commented 5 months ago

It has been more than 5 months since this was first reported.

It reminds me of an old issue. It's a crash / leak in Intel's media driver. I couldn't find the required minimal program to reproduce it for Intel engineers. It can be easily reproduced if they ever tried to use ffmpeg to decode a stream from security cameras. I provided the RTSP stream for them to test with. It didn't work for them. I eventually found the bugs myself, by reading the crash stack trace and related code.

I don't want to be too harsh here. I like the hardware and the software stack built upon the hardware. My bad experiences are related to how Intel engineers treat bugs. I don't think they ever test software enough. It's probably fine for open source projects as you have a community helping you. However, they don't take bugs seriously. They don't try hard to understand or fix the problems. Through my interaction with Intel engineers, it's always me pushing them. We even found some friends to push them from inside. No use. As a software engineer myself, I would be very sorry to hear such bugs from users. I would try my best to understand the issues and fix them. Some crashes and memory leaks are indeed difficult to find and fix. I don't think it's these cases are they can be reproduced so easily.

jingxu10 commented 5 months ago

Thanks for raising this issue and providing us additional data. Apologies for the limited status updates from us. We want to assure you that this issue is continually being worked on since it was first reported. Triaging has proven to be involved, taking a long time. We will share status as soon as we have a meaningful update. Thank you for your patience.

sobomax commented 5 months ago

broken torch @chao-camect I can feel your pain.

pujaltes commented 4 months ago

Is there an update to this issue @jingxu10? We are also having issues with this bug (same models result in stable cpu memory on Ampere GPUs) making it nearly impossible to train models of any significant size over extended periods of time with XPUs.

jingxu10 commented 4 months ago

Hi, We will update our findings soon.

jingxu10 commented 4 months ago

Same issue here with Kohya SS Stable Diffusion training. I am training a 512 dim SD 1.5 Lora at 1024x1536, Batch Size 1.

Kohya SS: https://github.com/kohya-ss/sd-scripts/

IPEX specific hijacks to fix dtype errors and make it compatible with CUDA syntax: https://github.com/kohya-ss/sd-scripts/tree/dev/library/ipex

Torch 2.1.0a0+cxx11.abi                                                                                                                                                                                              
Torch backend: Intel IPEX 2.1.10+xpu                                                                                                                                                                                 
Torch detected GPU: Intel(R) Arc(TM) A770 Graphics VRAM 16288 Compute Units 512
Python 3.10.13

OS: Arch Linux x86_64 
Kernel: 6.6.8-arch1-1
CPU: AMD Ryzen 7 5800X3D

Memory usage keeps increasing by 50 MB with each step. Reserved memory looks somewhat fine and stays around 5-12 GB. But virtual memory inflates to well above 400 GB after 7500 steps. Virtual memory usage shouldn't be actually allocated but it keeps itself as allocated even tough Linux detects it as virtual. ipexrun with jemalloc helps a little bit but 400 GB bloat is still too much.

This is how my memory usage looks like (With ipexrun, using jemalloc): Screenshot from 2024-01-08 19-06-28

This is how HTop sees it:

Screenshot from 2024-01-08 19-06-15

This is how GNOME System Monitor sees it:

Screenshot from 2024-01-08 19-05-53

I let it run a little bit more and it died with <Signals.SIGKILL: 9>.

This is how my memory usage looks like after it died: image

(venv) disty:~/Downloads $ python collect_env.py 
/home/disty/Apps/AI/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Collecting environment information...
PyTorch version: 2.1.0a0+cxx11.abi
PyTorch CXX11 ABI: Yes
IPEX version: 2.1.10+xpu
IPEX commit: a12f9f650
Build type: Release

OS: Arch Linux (x86_64)
GCC version: (GCC) 13.2.1 20230801
Clang version: 16.0.6
IGC version: 2024.0.0 (2024.0.0.20231017)
CMake version: version 3.28.1
Libc version: glibc-2.38

Python version: 3.10.13 (main, Aug 26 2023, 15:11:40) [GCC 13.2.1 20230801] (64-bit runtime)
Python platform: Linux-6.6.8-arch1-1-x86_64-with-glibc2.38
Is XPU available: True
DPCPP runtime version: 2024.0
MKL version: 2024.0
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512, gpu_eu_count=512)
Intel OpenCL ICD version: N/A
Level Zero version: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 7 5800X3D 8-Core Processor
CPU family:                         25
Model:                              33
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU(s) scaling MHz:                 80%
CPU max MHz:                        4548.8281
CPU min MHz:                        2200.0000
BogoMIPS:                           6802.23
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                     AMD-V
L1d cache:                          256 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           96 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] lion-pytorch==0.0.6
[pip3] numpy==1.24.4
[pip3] open-clip-torch==2.20.0
[pip3] pytorch-lightning==1.9.0
[pip3] torch==2.1.0a0+cxx11.abi
[pip3] torchmetrics==1.2.1
[pip3] torchvision==0.16.0a0+cxx11.abi
[conda] N/A

Hi @Disty0 , thanks for sharing this finding. Could you help us to reproduce your execution on our side? Any guidance or commands would be helpful.

jzhoulon commented 4 months ago

@sobomax thanks for the great info you provided. We get the similar picture when running Llama2 inference models using heaptrack, we also see the top1 memory identified by the tool is dnnl_memory_desc_create_with_strides, however, function leak check seems wrongly identified by the tool due to irregular code pattern in oneDNN code, “free” happens in destructor out of the temp object’s liveness scope. We have root caused some memory leak issue and is trying to fix and test it. If you can share your model link and test scripts, it will help us to identify your real pain.

Memory object `md` allocation side
548 status_t dnnl_memory_desc_create_with_strides(memory_desc_t **memory_desc,
549         int ndims, const dims_t dims, data_type_t data_type,
550         const dims_t strides) {
551     if (any_null(memory_desc)) return invalid_arguments;
552
553     auto md = utils::make_unique<memory_desc_t>();
554     if (!md) return out_of_memory;
555     CHECK(memory_desc_init_by_strides(*md, ndims, dims, data_type, strides));
556     (*memory_desc) = md.release();
557     return success;
558 }
559
//memory object `md` deallocation side.
67 template <>
   68 struct handle_traits<dnnl_memory_t> {
   69     static dnnl_status_t destructor(dnnl_memory_t p) {
   70         return dnnl_memory_destroy(p);
   71     }
   72 };
   73
   74 template <>
   75 struct handle_traits<dnnl_primitive_desc_t> {
   76     static dnnl_status_t destructor(dnnl_primitive_desc_t p) {
   77         return dnnl_primitive_desc_destroy(p);
   78     }
   79 };
   80
   81 template <>
   82 struct handle_traits<dnnl_primitive_t> {
   83     static dnnl_status_t destructor(dnnl_primitive_t p) {
   84         return dnnl_primitive_destroy(p);
   85     }
sobomax commented 4 months ago

@sobomax thanks for the great info you provided. We get the similar picture when running Llama2 inference models using heaptrack, we also see the top1 memory identified by the tool is dnnl_memory_desc_create_with_strides, however, function leak check seems wrongly identified by the tool due to irregular code pattern in oneDNN code, “free” happens in destructor out of the temp object’s liveness scope. We have root caused some memory leak issue and is trying to fix and test it. If you can share your model link and test scripts, it will help us to identify your real pain.

@jzhoulon the following simple inference code reliably leaks the memory at my A770 at the rate of about 1GB in 5 minutes. Hope it helps. Thanks for your help!

ipex_bug.py.txt

Disty0 commented 4 months ago

Hi @Disty0 , thanks for sharing this finding. Could you help us to reproduce your execution on our side? Any guidance or commands would be helpful.

Code structure of that repo changes so ipex code sits in here now: https://github.com/kohya-ss/sd-scripts/tree/bfb352bc433326a77aca3124248331eb60c49e8c/library/ipex

Installation guide for IPEX: https://www.technopat.net/sosyal/konu/installing-kohya-ss-with-intel-arc-gpus.2869152/

Example training config: (Paths should point to a folder with another folder named something like 1_name that has images in it.)

``` { "LoRA_type": "Standard", "LyCORIS_preset": "full", "adaptive_noise_scale": 0, "additional_parameters": "", "block_alphas": "", "block_dims": "", "block_lr_zero_threshold": "", "bucket_no_upscale": true, "bucket_reso_steps": 64, "cache_latents": true, "cache_latents_to_disk": true, "caption_dropout_every_n_epochs": 0.0, "caption_dropout_rate": 0.24, "caption_extension": "", "clip_skip": "1", "color_aug": false, "constrain": 0.0, "conv_alpha": 1, "conv_block_alphas": "", "conv_block_dims": "", "conv_dim": 1, "debiased_estimation_loss": false, "decompose_both": false, "dim_from_weights": false, "down_lr_weight": "", "enable_bucket": true, "epoch": 100, "factor": -1, "flip_aug": false, "fp8_base": false, "full_bf16": true, "full_fp16": false, "gpu_ids": "", "gradient_accumulation_steps": "1", "gradient_checkpointing": true, "keep_tokens": 12, "learning_rate": 0.0001, "logging_dir": "/mnt/DataSSD/AI/train/raifu/log", "lora_network_weights": "", "lr_scheduler": "constant", "lr_scheduler_args": "", "lr_scheduler_num_cycles": "", "lr_scheduler_power": "", "lr_warmup": 0, "max_bucket_reso": 2048, "max_data_loader_n_workers": "0", "max_grad_norm": 1, "max_resolution": "1024,1536", "max_timestep": 1000, "max_token_length": "75", "max_train_epochs": "", "max_train_steps": "", "mem_eff_attn": false, "mid_lr_weight": "", "min_bucket_reso": 256, "min_snr_gamma": 0, "min_timestep": 0, "mixed_precision": "bf16", "model_list": "custom", "module_dropout": 0, "multi_gpu": false, "multires_noise_discount": 0, "multires_noise_iterations": 0, "network_alpha": 8, "network_dim": 64, "network_dropout": 0, "no_token_padding": false, "noise_offset": 0, "noise_offset_type": "Original", "num_cpu_threads_per_process": 2, "num_machines": 1, "num_processes": 1, "optimizer": "AdamW", "optimizer_args": "", "output_dir": "/mnt/DataSSD/AI/train/raifu/model", "output_name": "raifu", "persistent_data_loader_workers": false, "pretrained_model_name_or_path": "cagliostrolab/animagine-xl-3.0", "prior_loss_weight": 1.0, "random_crop": false, "rank_dropout": 0, "rank_dropout_scale": false, "reg_data_dir": "", "rescaled": false, "resume": "", "sample_every_n_epochs": 0, "sample_every_n_steps": 0, "sample_prompts": "", "sample_sampler": "euler_a", "save_every_n_epochs": 1, "save_every_n_steps": 250, "save_last_n_steps": 0, "save_last_n_steps_state": 0, "save_model_as": "safetensors", "save_precision": "fp16", "save_state": false, "scale_v_pred_loss_like_noise_pred": false, "scale_weight_norms": 0, "sdxl": true, "sdxl_cache_text_encoder_outputs": false, "sdxl_no_half_vae": false, "seed": "123456789", "shuffle_caption": false, "stop_text_encoder_training": 0, "text_encoder_lr": 0.0, "train_batch_size": 2, "train_data_dir": "/mnt/DataSSD/AI/train/raifu/img", "train_norm": false, "train_on_input": true, "training_comment": "", "unet_lr": 0.0001, "unit": 1, "up_lr_weight": "", "use_cp": false, "use_scalar": false, "use_tucker": false, "use_wandb": false, "v2": false, "v_parameterization": false, "v_pred_like_loss": 0, "vae": "", "vae_batch_size": 0, "wandb_api_key": "", "weighted_captions": false, "xformers": "sdpa" } ```

I checked if my GradScaler CPU offload is leaking memory but Kohya SS doesn't use GradScaler at all.

sobomax commented 4 months ago

@pujaltes @jzhoulon I've upgraded to the latest v2.1.30+xpu build with the recommended OneAPI version as prescribed here (https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu&os=linux%2fwsl2&package=pip), the xpu inference leak still exists. Running ipex_bug.py provided in my earlier message STILL leaks memory at the rate of about 1GB/10minutes. So whatever the bug fixed in #615 was, it's not the same as I am seeing here. We are still waiting for a proper investigation and fix. Thanks!

ScreenShot1280

ScreenShot1281

devpramod commented 4 months ago

@sobomax we will rerun the tests with v2.1.30+xpu on the script you provided and update our findings soon.

huiyan2021 commented 4 months ago

Hi @Disty0 ,

could you also try v2.1.30+xpu with oneAPI 2024.1 and see if memory leak still exists? I can not reproduce "Memory usage keeps increasing by 50 MB with each step." at my side using your script...

huiyan2021 commented 4 months ago

Hi @Disty0 , we can not reproduce memory increasement during training using both 2.1.20+xpu and 2.1.30+xpu with oneAPI 2024.1, what I observed is that memory increases after I click "Start training" from gradio web UI, but keeps steady when it really starts training with progress bar... I noticed you are using CPU: AMD Ryzen 7 5800X3D, I am not sure if it matters, and we don't have this CPU at hand...

Hi @sobomax, we can reproduce memory increasement for your case, developer team is looking into it...

tye1 commented 3 months ago

@jzhoulon root caused this host memory leak issue existed in block format tensor only. We will provide a patch release based on IPEX v2.1.30+xpu soon.

tye1 commented 3 months ago

We have fixed the host memory leak issue in IPEX v2.1.30+xpu patch release, the source code is available in https://github.com/intel/intel-extension-for-pytorch/commits/release/xpu/2.1.30/ (https://github.com/intel/intel-extension-for-pytorch/commit/57f2d8fcf265a1c6c29230cd9b864048fef25a54), the binary wheel is available in https://pytorch-extension.intel.com/release-whl/stable/xpu/us/intel-extension-for-pytorch/. You may install it via: 'python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30.post0 oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/'.

Could you please help to verify in your scenarios? Thanks a lot. @Disty0 @sobomax

sobomax commented 2 months ago

@tye1 thanks, I've tested the post2 build and it seems to not leak hostmem anymore. Thanks!

tye1 commented 2 months ago

@sobomax Great! Let me close this now. Feel free to reopen or create a new ticket if you still see the issue with the updated release. @Disty0