Segmentation Fault in Classification on different networks in parallel

MinhazPalasara commented 6 years ago

Hi,

I have been trying to use Intel Caffe for a project that serves a deep model to multiple users in parallel. We have multiple copies of the model loaded from same .caffemodel file, in case it matters. One copy is used by single thread at a time. I have been getting Segmentation fault for this setting.

To produce this issue and avoid any possible problems with my project I modified the examples/cpp_classification/classification.cpp. I have two threads creating separate network instances running classification on an image in parallel. Intel caffe is compiled for single node with MKLDNN engine.

void  callClassify(Classifier& classifier, string& file, int count){
        std::cout << "---------- Prediction for "<< file << " ----------" << std::endl;
        int  j = 0;
        for(j = 0 ; j < 100; j++){
                cv::Mat img = cv::imread(file, -1);
                CHECK(!img.empty()) << "Unable to decode image " << file;
                std::vector<Prediction> predictions = classifier.Classify(img);
                /* Print the top N predictions. */
                for (size_t i = 0; i < predictions.size(); ++i) {
                        Prediction p = predictions[i];
                        std::cout << "thread: "<<count <<"  "<<std::fixed << std::setprecision(4) << p.second<<" - \""<< p.first << "\"" << std::endl;
                }
        }
}
int main(int argc, char** argv) {
  if (argc < 6) {
    std::cerr << "Usage: " << argv[0]
              << " deploy.prototxt network.caffemodel"
              << " mean.binaryproto labels.txt img.jpg [CAFFE|MKL2017|MKLDNN]" << std::endl;
    return 1;
  }

  ::google::InitGoogleLogging(argv[0]);

  string model_file   = argv[1];
  string trained_file = argv[2];
  string mean_file    = argv[3];
  string label_file   = argv[4];
  string file         = argv[5];
  string engine = "";
  if (argc > 6) {
    engine = argv[6];
  }

#ifdef USE_MLSL
  caffe::mn::init(&argc,&argv);
#endif

  Classifier classifier(model_file, trained_file, mean_file, label_file, engine);
  Classifier classifier1(model_file, trained_file, mean_file, label_file, engine);

  std::thread first(callClassify, std::ref(classifier), std::ref(file), 1);     // spawn new thread that calls foo()
  std::thread second(callClassify, std::ref(classifier1), std::ref(file), 2); 

  //std::thread second (bar,0);'
  first.join();
  second.join();
}
#else
int main(int argc, char** argv) {
  LOG(FATAL) << "This example requires OpenCV; compile with USE_OPENCV.";
}
#endif  // USE_OPENCV

On running classification I am getting.

[New Thread 0x7fffd4ff0700 (LWP 17143)]
---------- Prediction for examples/images/cat.jpg ----------
[New Thread 0x7fffcfbfe700 (LWP 17144)]
---------- Prediction for examples/images/cat.jpg ----------
[New Thread 0x7fffcf3fca00 (LWP 17145)]
[New Thread 0x7fffceffba80 (LWP 17146)]
[New Thread 0x7fffcebfab00 (LWP 17147)]
[New Thread 0x7fffce7f9b80 (LWP 17148)]
[New Thread 0x7fffce3f8c00 (LWP 17149)]
[New Thread 0x7fffcdff7c80 (LWP 17150)]
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"
thread: 2  0.3134 - "n02123045 tabby, tabby cat"
thread: 2  0.2380 - "n02123159 tiger cat"
thread: 2  0.1235 - "n02124075 Egyptian cat"
thread: 2  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 2  0.0715 - "n02127052 lynx, catamount"
thread: 1  0.3134 - "n02123045 tabby, tabby cat"
thread: 1  0.2380 - "n02123159 tiger cat"
thread: 1  0.1235 - "n02124075 Egyptian cat"
thread: 1  0.1003 - "n02119022 red fox, Vulpes vulpes"
thread: 1  0.0715 - "n02127052 lynx, catamount"

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd4ff0700 (LWP 17143)]
0x00007fffec2f8a53 in mkldnn::impl::stream_eager_t::submit_impl(unsigned long, unsigned long, mkldnn_primitive**) ()
   from /home/ibmadmin/Repositories/caffe-1/external/mkldnn/install/lib/libmkldnn.so.0

May be I am missing some design details, any help would be appreciated.

matt-ny commented 6 years ago

@jgong5 @hshen14 any ideas on this issue?

Is it supposed to be safe to use use Intel Caffe w/ MKLDNN in a multi-threaded application? We did not find any docs definitely saying one way or the other. Any pointers you can give us would be helpful.

jgong5 commented 6 years ago

@matt-ny The problem should lie in the global stream handler which is now a singleton.

matt-ny commented 6 years ago

thank you @jgong5 for your answer, however, I'm not sure I understand - are you saying that with the latest MKLDNN code, we should not invoke Classifier.Classify(img); in multiple threads, even for different classifiers?

Could you share a link to the definition in the code of the global stream handler singleton you mentioned?

Finally, is this also true of Intel Caffe + MKL2017? In our experience, BVLC Caffe w/ MKL has not had any problem with re-entrancy as long as the Net objects were different. Are there design docs which specify the differences between BVLC / Intel Caffe + MKL / Intel Caffe + MKLDNN?

thanks so much!

jgong5 commented 6 years ago

@matt-ny Please check "src/caffe/mkldnn_base.cpp". I am not sure about MKL2017 though.

coolbei commented 5 years ago

@jgong5 I ran into maybe the same problem, and i compile mkldnn with MKLDNN_ENABLE_CONCURRENT_EXEC = ON

======= Backtrace: ========= /lib64/libc.so.6(+0x75366)[0x7f6fe2f2d366] /home/xxx/intelcaffe/caffe/external/mkldnn/install/lib/libmkldnn.so.0(+0x62d9a)[0x7f6fe691bd9a] /home/xxx/intelcaffe/caffe/external/mkldnn/install/lib/libmkldnn.so.0(+0x619ef)[0x7f6fe691a9ef] /home/xxx/intelcaffe/caffe/external/mkldnn/install/lib/libmkldnn.so.0(mkldnn_stream_submit+0xe0)[0x7f6fe691ab30] /home/xxx/intelcaffe/caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2(_ZN5caffe15MKLDNNPrimitiveIfE6submitEv+0x4f7)[0x7f6fe65011d7]

so is this because of the same reason? can you tell me how I can fix it? Thanks

intel / caffe

Segmentation Fault in Classification on different networks in parallel #230