Performance on Xeon Scalable

Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg rdpid fsrm md_clear flush_l1d arch_capabilities

With latest version of neuralchat_server and neural-speed in combination with intel-extension-for-transformers with following config:

host: "0.0.0.0"
port: 8000
model_name_or_path: "/root/Intel/neural-chat-7b-v3-3"
device: cpu
tasks_list: ["textchat"]

optimization:
  use_neural_speed: true
  optimization_type: weight_only
  compute_dtype: fp32
  weight_dtype: int8

We are seeing extremely slow time to first token with example prompts like Tell me about Intel Xeon Scalable Processors.

With following measured times :

Weight Precision	Max Tokens	Response Time
Int8	unset	73s
Int8	128	69s
Int4	unset	73s
Int4	128	65s

Without neural-speed compression of said model, we got inference times to only around 20s.

Is there any misconfiguration on our part?

I would love to hear your feedback and appreciate any help.

intel / neural-speed

Performance on Xeon Scalable #284