Feature request: Intel AI optimized ResNet50 v1.5 in CM/Docker

WarrenSchultz commented 1 month ago

Please add Intel's implementation of ResNet50 v1.5 for CPU and XPU (desktop, mobile, Arc, Flex, Max, etc. optimizations) to the CM Docker-based workflow.

Thanks!

arjunsuresh commented 4 weeks ago

Sorry @WarrenSchultz I couldn't update you on this. Intel ResNet50 implementation was a bit difficult to add but we are still working on it. I'll update you tomorrow.

WarrenSchultz commented 4 weeks ago

No worries, thanks for the update!

arjunsuresh commented 4 weeks ago

Hi @WarrenSchultz unfortunately I couldn't complete it today. And this Friday we are having a tutorial on CM for MLPerf inference and so I'll be able to fix this only next week. Please feel free to join the tutorial if you have time.

WarrenSchultz commented 1 week ago

Hi @arjunsuresh , I see that there was a PR merged last week. Any update on getting this running (or if it is implemented at some level, docs to do so?) Thanks!

arjunsuresh commented 1 week ago

Hi @WarrenSchultz it is only partially done currently and we are working on it today. Hope to get it running by EOD.

arjunsuresh commented 1 week ago

It is still not working but we are nearly there.

WarrenSchultz commented 1 week ago

Great, thank you!

WarrenSchultz commented 1 week ago

I hate to ask before everything is checked in, but as I was looking through the code in progress as it currently stands, I wanted to confirm the Intel GPU (Arc, NPU, etc) code is going to be supported in the CM implementation (e.g. https://www.intel.com/content/www/us/en/support/articles/000097597/processors.html ) Thanks!

arjunsuresh commented 1 week ago

No worries @WarrenSchultz. We are adding the official Intel code - AFAIK it only supports CPU that too Intel 13th generation and above for most of the benchmarks. If there is any Intel documentation for benchmarking on Intel GPUs we can add them in CM.

We have finished adding ResNet50 Intel implementation and getting expected performance - 5800 QPS on a 24 core machine. But the accuracy is only 1% - so we are debugging what is happening here.

WarrenSchultz commented 1 week ago

Thanks @arjunsuresh , this is the code I was referencing, I forgot to include it with the link above. We're working with mobile Arc GPUs, so I'm not positive that it will work with the current hardware we're testing, but we have done testing with datacenter-grade Intel GPUs in the past.

https://github.com/intel/models/tree/master/models_v2/pytorch/resnet50v1_5/inference/gpu

WarrenSchultz commented 1 week ago

Example hardware would be from the Core Ultra 9/7/etc series, which is supposed to support "OpenVINO™, WindowsML, DirectML, ONNX RT, WebGPU".

https://www.intel.com/content/www/us/en/products/sku/236849/intel-core-ultra-9-processor-185h-24m-cache-up-to-5-10-ghz/specifications.html

Thanks!

WarrenSchultz commented 1 week ago

We have finished adding ResNet50 Intel implementation and getting expected performance - 5800 QPS on a 24 core machine. But the accuracy is only 1% - so we are debugging what is happening here.

I missed this edit. Thanks for digging into it.

arjunsuresh commented 1 week ago

Thank you @WarrenSchultz for sharing the link. It should not be difficult to add that in CM but I believe there are no MLPerf loadgen support for it from Intel - they might do it in the next round of inference submissions. You are looking to get MLPerf scores on Intel GPU right?

We are still checking the accuracy issue for Intel R50 but the workflow is now working in docker and so you can try it for getting the performance. This needs pytorch build from src.

cm run script --tags=run-mlperf,inference --implementation=intel --model=resnet50 --quiet --docker

WarrenSchultz commented 1 week ago

@arjunsuresh Great, thanks, I'll give it a shot. Do you know the memory requirements for this offhand to build?

arjunsuresh commented 1 week ago

Not exactly. We have tested on 64G systems but we do have a lot of swap space. Most probably 32G should be sufficient. The accuracy issue is partially solved by using the Intel published quantization scales - so the problem is in generating them locally.


Accuracy file: /home/cmuser/CM/repos/local/cache/f88c69500c934b67/test_results/9cfb69cca02c-intel-cpu-pytorch-vdefault-default_config/resnet50/offline/accuracy/accuracy.txt

accuracy=74.474%, good=37237, total=50000

WarrenSchultz commented 1 week ago

Oh, @arjunsuresh just to confirm, this is from gateoverflow@cm4mlops or is there a better repo?

arjunsuresh commented 1 week ago

yes thats the repo for the Intel MLPerf docker command, but this change is not pushed as it is not perfect yet. Today we have a presentation on CM scripts and so won't be able to fix this issue. Hopefully it'll be resolved tomorrow.

Also, now we have pip install cm4mlops which does the cm pull repo mlcommons@cm4mlops --branch=mlperf-inference which is a more stable version than gateoverflow@cm4mlops which is where the MLPerf development happens. We normally sync these every couple of days.

WarrenSchultz commented 1 week ago

buildlog.txt Testing based on what is in the gateoverflow repo as of this moment, worked fine on an older machine, but failed on a Core Ultra 9. I teed the output to the attached file in case it helps.

arjunsuresh commented 1 week ago

@WarrenSchultz I believe the failure there is due to some micro-architecture specific compilation. We are testing on Intel sapphire rapids - will give a try on i9 now. The accuracy issue is now sorted as well.

arjunsuresh commented 1 week ago

Same issue here. It fails on i9 but works fine on Xeon.

(cm) arjun@intel-spr-i9:~/CM/repos/gateoverflow@cm4mlops/script$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                GenuineIntel
  Model name:             13th Gen Intel(R) Core(TM) i9-13900K
    CPU family:           6
    Model:                183
    Thread(s) per core:   2
    Core(s) per socket:   24
    Socket(s):            1
    Stepping:             1
    CPU(s) scaling MHz:   17%
    CPU max MHz:          5800.0000
    CPU min MHz:          800.0000
    BogoMIPS:             5990.40
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtsc
                          p lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_
                          cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
                          owprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 er
                          ms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat 
                          pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm 
                          md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    896 KiB (24 instances)
  L1i:                    1.3 MiB (24 instances)
  L2:                     32 MiB (12 instances)
  L3:                     36 MiB (1 instance)

arjun@arjun-spr:~/CM/repos/gateoverflow@cm4mlops/script/install-pytorch-from-src$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) w7-2495X
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  24
    Socket(s):           1
    Stepping:            8
    CPU(s) scaling MHz:  62%
    CPU max MHz:         4800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4992.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp
                          lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cp
                         l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm a
                         bm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi fle
                         xpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt c
                         lwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect
                          avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni
                          vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serial
                         ize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   1.1 MiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    48 MiB (24 instances)
  L3:                    45 MiB (1 instance)

WarrenSchultz commented 1 week ago

Well, that's not awesome. Do you see a way around that on your side, or should I try to engage with Intel?

arjunsuresh commented 1 week ago

Our first priority is to reproduce the Intel submissions on Xeon. But I can try to use some compilation flags to see if it works on i9. Do you have the CPU info of the old system where it ran fine?

WarrenSchultz commented 6 days ago

Sorry I missed this. The one that worked is an older Xeon.

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 7 BogoMIPS: 5786.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss e2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_rel iable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid _single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx5 12bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all): L1d: 1 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 32 MiB (32 instances) L3: 22 MiB (1 instance) Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Mitigation; TSX disabled

arjunsuresh commented 6 days ago

Thank you @WarrenSchultz . It seems the Intel ResNet50 kernel code is using intrinsics and they are done only for Xeon processors. I'm not familiar with the difference between Xeon and i9 series. Passing different "-march" flag didn't help because the intrinsic code is hardwired. Did Intel confirm that their code can run on desktop processors?

All the dependencies are working fine - only the final R50 kernel code is the problem.

WarrenSchultz commented 6 days ago

Seeing if I can find an answer. Is the code you're sourcing from the SPR specific container code, or is the code the same? (Or is there an answer in their docker hub image configurations?)

arjunsuresh commented 6 days ago

This is the code. If we see the official MLPerf results, Intel (including their server partners) has only submitted results on only a couple of server configurations and so we can't easily get a support from their MLPerf team for running on a different CPU.

WarrenSchultz commented 6 days ago

Got it, thanks. Looking into it. Also, the container did eventually run on an 11th gen i7. (But at 16GB of RAM, it took an eternity to do so)

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz CPU family: 6 Model: 140 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 BogoMIPS: 5990.42 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_ts c arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsav e avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpc id avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all): L1d: 192 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 5 MiB (4 instances) L3: 12 MiB (1 instance) Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected

arjunsuresh commented 6 days ago

oh. So, the R50 build worked fine on 11th gen i7 and fails on 13th gen i9? That's surprising...

mlcommons / cm4mlops

Feature request: Intel AI optimized ResNet50 v1.5 in CM/Docker #61