Not able to regenerate R50 model on Intel Sapphire Rapids (using Intel code)

arjunsuresh commented 1 year ago

I'm following this README to run R50 inference on an Intel Sapphire Rapids 8 core cloud instance. Unfortunately the accuracy is pretty poor (ran for the first 500 images) for both offline and server scenarios. Can you please let me know what could have gone wrong here?

For bert-99 both performance and accuracy are as expected on a 8 core Sapphire Rapids

arjunsuresh@instance-1:~/inference_results_v3.0/closed/Intel/code/resnet50/pytorch-cpu$ bash run_server_accuracy.sh
 [SUT] Creating instance 0
 [SUT] Instance 0 created on cores 0 1.
 [SUT] Creating instance 1
 [SUT] Instance 1 created on cores 2 3.
 [SUT] Creating instance 2
 [SUT] Instance 2 created on cores 4 5.
 [SUT] Creating instance 3
 [SUT] Instance 3 created on cores 6 7.
 Inference warmup for instance 0
 Inference warmup for instance 2
 Inference warmup for instance 3
 Inference warmup for instance 1
 STARTING TEST
 Exiting Batcher thread
 Exiting thread 140226764715584
 Exiting thread 140226781500992
 Exiting thread 140226773108288
 Exiting thread 140226756322880
 ===================================
         Evaluating Accuracy
 ===================================
accuracy=0.600%, good=3, total=500


arjunsuresh@instance-1:~/inference_results_v3.0/closed/Intel/code/resnet50/pytorch-cpu$ bash run_offline_accuracy.sh 8
Testing BATCH SIZE is  8
 [SUT] Creating instance 0
 [SUT] Instance 0 created on cores 0.
 Inference warmup for instance 0
 STARTING TEST
 Exiting thread 139999873193536
 ===================================
         Evaluating Accuracy
 ===================================
accuracy=1.000%, good=5, total=500

arjunsuresh commented 1 year ago

It seems the model generation was failing. By using the shared model file and no new calibration, we are getting the below accuracy

arjunsuresh@instance-1:~/inference_results_v3.0/closed/Intel/code/resnet50/pytorch-cpu$ bash run_offline_accuracy.sh 256
Testing BATCH SIZE is  256
 [SUT] Creating instance 0
 [SUT] Instance 0 created on cores 0.
 Inference warmup for instance 0
 STARTING TEST
 Exiting thread 139979049121344
 ===================================
         Evaluating Accuracy
 ===================================
accuracy=74.568%, good=37284, total=50000

arjunsuresh commented 1 year ago

This is the error in generating the model

bash generate_torch_model.sh
terminate called after throwing an instance of 'c10::Error'
  what():
Mismatch in kernel C++ signatures
  operator: torchvision::roi_align
    no debug info
  kernel 1: at::Tensor (at::Tensor const&, at::Tensor const&, double, c10::SymInt, c10::SymInt, long, bool)
    dispatch key: Autograd
    registered at /home/arjunsuresh/intel-mlperf-inference-code/closed/Intel/code/resnet50/pytorch-cpu/rn50-mlperf/vision/torchvision/csrc/ops/autograd/roi_align_kernel.cpp:157
  kernel 2: at::Tensor (at::Tensor const&, at::Tensor const&, double, long, long, long, bool)
    dispatch key: CPU
    registered at /home/arjunsuresh/intel-mlperf-inference-code/closed/Intel/code/resnet50/pytorch-cpu/rn50-mlperf/vision/torchvision/csrc/ops/cpu/roi_align_kernel.cpp:390

Exception raised from registerKernel at /home/arjunsuresh/intel-mlperf-inference-code/closed/Intel/code/resnet50/pytorch-cpu/rn50-mlperf/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:97 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f48bc7c516e in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f48bc79b0bd in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x253 (0x7f48abde6e63 in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x17e (0x7f48abddc9ce in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::Library::_impl(char const*, torch::CppFunction&&) & + 0x886 (0x7f48abe158f6 in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x17d6e (0x7f48948d8d6e in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/site-packages/torchvision-0.16.0a0+9b82df4-py3.9-linux-x86_64.egg/torchvision/_C.so)
frame #6: <unknown function> + 0x647e (0x7f48bd6ab47e in /lib64/ld-linux-x86-64.so.2)
frame #7: <unknown function> + 0x6568 (0x7f48bd6ab568 in /lib64/ld-linux-x86-64.so.2)
frame #8: _dl_catch_exception + 0xe5 (0x7f48bd374c85 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xdff6 (0x7f48bd6b2ff6 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x88 (0x7f48bd374c28 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: <unknown function> + 0xe34e (0x7f48bd6b334e in /lib64/ld-linux-x86-64.so.2)
frame #12: <unknown function> + 0x906bc (0x7f48bd2906bc in /lib/x86_64-linux-gnu/libc.so.6)
frame #13: _dl_catch_exception + 0x88 (0x7f48bd374c28 in /lib/x86_64-linux-gnu/libc.so.6)
frame #14: _dl_catch_error + 0x33 (0x7f48bd374cf3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: <unknown function> + 0x901ae (0x7f48bd2901ae in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: dlopen + 0x48 (0x7f48bd290748 in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: <unknown function> + 0x1623a (0x7f48bd1b423a in /home/arjunsuresh/anaconda3/envs/rn50-mlperf/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so)
<omitting python frames>

Madhumitha-MCW commented 1 year ago

Hi @arjunsuresh, I am trying to run the int8 model that I generated after rectifying this "terminate called after throwing an instance of 'c10::Error' " error by building with a compatible version of torchvision. But I am facing issues in reproducing the results. "run_server_accuracy.sh" gives 0.1% accuracy while the "run_offline_accuracy.sh" does not run at all and says the "mlperf_log_accuracy.json" is empty. Could you help me on the method that you followed to get/generate the model file ?

arjunsuresh commented 1 year ago

@Madhumitha-MCW Thank you for fixing the torchvision issue. I'll give that a try.

"run_offline_accuracy.sh" does not run at all and says the "mlperf_log_accuracy.json" is empty.

Can you please share the exact output here? IIRC, I didn't do anything special to run the model while following the Intel instructions -- only some run config changes to go down to a 4 core system including.

Madhumitha-MCW commented 1 year ago

@arjunsuresh, thanks for responding. The following is the snapshot of the result of "run_offline_accuracy.sh" with BS=8. It just sets up the instances and runs into this error: But this issue does not occur for BS=256 (which gives 0.1% as the result)

Also can you please tell me how you obtained the model if not generated?

arjunsuresh commented 1 year ago

It says core dumped. Are you running on 112 core system?

arjunsuresh commented 1 year ago

"Also can you please tell me how you obtained the model if not generated?"

It seems I was generating the model but without calibration.

Madhumitha-MCW commented 1 year ago

It says core dumped. Are you running on 112 core system?

Yes. Should I make any other changes?

Also, on this, "It seems I was generating the model but without calibration.", I find the calibration step only for dataset and not for the model, which calibration are you referring to?

arjunsuresh commented 1 year ago

Can you try without jemalloc? I believe this is the part I had to skip.

We'll redo these in CM automation by this week so that it is easily reproducible without having to think back to what exactly happened :)

Madhumitha-MCW commented 1 year ago

Thank you, I tried without jemalloc and offline accuracy script with BS=8 works without errors, though I face low accuracy.

kamal-nabhan commented 1 year ago

Hi @arjunsuresh, Apologies for taking the thread off-topic but you have claimed that you were able to run bert-99 on Sapphire Rapids without any issue. I'm following this readme and am facing the below issue when I run the prepare_env.sh script.

error: always_inline function '_tile_release' requires target feature 'amx-tile', but would be inlined into function 'set_config' that is compiled without support for 'amx-tile'
    _tile_release();

The error comes up when the script triggers the final ninja build command and this fails to build mlperf_plugins.so .

Could you please let me know if there were any such issues that you had encountered while running the bert-99 model. Could you also let me know of the dependencies version that you had used (GCC version and such) ?

arjunsuresh commented 12 months ago

@Madhumitha-MCW You're welcome. By low accuracy you mean <1%?

@kamal-nabhan yes, I was following that README and IIRC there were no such surprises for the bert run. We have infact submitted that result to MLPerf inference v3.1 and the details should be public by next week. So, we can have an easier README and automation by next week. The system details we used are as follows:

"framework": "Intel inference implementation with CM API, Pytorch v1.12",
  "host_memory_capacity": "32G",
  "host_memory_configuration": "Error Correction Type: Multi-bit ECC; Type: RAM,   DIMM 0; Size: 16 GB,   DIMM 1; Size: 16 GB",
  "host_networking": "Gig Ethernet",
  "host_network_card_count": "1",
  "host_networking_topology": "N/A",
  "host_processor_caches": "L1d cache: 192 KiB (4 instances), L1i cache: 128 KiB (4 instances), L2 cache: 8 MiB (4 instances), L3 cache: 105 MiB (1 instance)",
  "host_processor_core_count": "4",
  "host_processor_frequency": "2.70GHz",
  "host_processor_interconnect": "",
  "host_processor_model_name": "Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz",
  "host_processors_per_node": "1",
  "host_storage_capacity": "316G",
  "host_storage_type": "SSD",
  "hw_notes": "",
  "number_of_nodes": "1",
  "operating_system": "Ubuntu 22.04 (linux-5.19.0-1030-gcp-glibc2.35)",
  "other_software_stack": "Python: 3.10.12, GCC-11.4.0",
  "status": "available",
  "submitter": "CTuning",
  "sw_notes": "Powered by MLCommons CM automation language and CK playground. ",
  "system_name": "Google Cloud Platform (c3.standard.8)",

Madhumitha-MCW commented 12 months ago

Yes @arjunsuresh. Accuracy of about 0.08%

Madhumitha-MCW commented 12 months ago

@arjunsuresh can you please let me know how long is the "run_offline_accuracy.sh" expected to take to finish running for resnet50?

arjunsuresh commented 11 months ago

@Madhumitha-MCW sorry, I missed replying you.

Offline accuracy for resnet50 runs over 50000 inputs and at 20,000 qps, this should be over in a few seconds.

Meanwhile inference_results_v3.1 codes are now public and so we can now try those codes.

Madhumitha-MCW commented 11 months ago

Hi thanks, sure, I will try out

mlcommons / inference_results_v3.0

Not able to regenerate R50 model on Intel Sapphire Rapids (using Intel code) #14