stanford-futuredata / noscope

Accelerating network inference over video
http://dawn.cs.stanford.edu/2017/06/22/noscope/
436 stars 122 forks source link

GTX1070:CUDA Error: out of memory #17

Open Megatron2032 opened 7 years ago

Megatron2032 commented 7 years ago

GTX1070:7.9G memory when i run_optimizerset.sh,the train_9180_18360.log displayed errors.

train_9180_18360.log: 2017-08-28 17:31:50.229531: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-08-28 17:31:50.351945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-08-28 17:31:50.352224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.8225 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.31GiB 2017-08-28 17:31:50.352235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 2017-08-28 17:31:50.352240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 2017-08-28 17:31:50.352249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) layer filters size input output 0 CUDA Error: out of memory: File exists CUDA Error: out of memory

Arsey commented 7 years ago

I'm getting the same error even with 1000 frames video

ddkang commented 7 years ago

The system is optimized for a P100 GPU with 16GB of memory. This diff is confirmed to work on a K80, you may need to change 0.8 to much less:

diff --git a/tensorflow/noscope/noscope.cc b/tensorflow/noscope/noscope.cc
index 4cd6a14..98b80e2 100644
--- a/tensorflow/noscope/noscope.cc
+++ b/tensorflow/noscope/noscope.cc
@@ -60,7 +60,7 @@ static tensorflow::Session* InitSession(const std::string& gra                                                                             ph_fname) {
   tensorflow::SessionOptions opts;
   tensorflow::GraphDef graph_def;
   // YOLO needs some memory
-  opts.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.9);
+  opts.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.8);
   // opts.config.mutable_gpu_options()->set_allow_growth(true);
   tensorflow::Status status = NewSession(opts, &session);
   TF_CHECK_OK(status);

I'd be happy to merge a pull request that automatically detects the amount of memory necessary for YOLOv2 as a fraction of the available GPU memory.

Arsey commented 7 years ago

0.8 works fine for GTX 1070 and there is no memory error, but now I'm getting Segmentation fault (core dumped). What can it be?

Update: The same issue for yolo9000 and tiny-yolo

ddkang commented 7 years ago

Please paste the full output log from the run

Arsey commented 7 years ago
(noscope) arsey@ml-machine:~/noscope/data/experiments/jackson-town-square/train/jackson-town-square_convnet_128_32_2.pb-non_blocked_mse.src$ ./run_optimizerset.sh 1
2017-08-29 20:17:53.261228: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 20:17:53.390616: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-29 20:17:53.391159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7465
pciBusID 0000:05:00.0
Total memory: 7.92GiB
Free memory: 7.83GiB
2017-08-29 20:17:53.391171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-08-29 20:17:53.391175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-08-29 20:17:53.391180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0)
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   608 x 608 x   3   ->   608 x 608 x  32
    1 max          2 x 2 / 2   608 x 608 x  32   ->   304 x 304 x  32
    2 conv     64  3 x 3 / 1   304 x 304 x  32   ->   304 x 304 x  64
    3 max          2 x 2 / 2   304 x 304 x  64   ->   152 x 152 x  64
    4 conv    128  3 x 3 / 1   152 x 152 x  64   ->   152 x 152 x 128
    5 conv     64  1 x 1 / 1   152 x 152 x 128   ->   152 x 152 x  64
    6 conv    128  3 x 3 / 1   152 x 152 x  64   ->   152 x 152 x 128
    7 max          2 x 2 / 2   152 x 152 x 128   ->    76 x  76 x 128
    8 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256
    9 conv    128  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x 128
   10 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256
   11 max          2 x 2 / 2    76 x  76 x 256   ->    38 x  38 x 256
   12 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512
   13 conv    256  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x 256
   14 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512
   15 conv    256  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x 256
   16 conv    512  3 x 3 / 1    38 x  38 x 256   ->    38 x  38 x 512
   17 max          2 x 2 / 2    38 x  38 x 512   ->    19 x  19 x 512
   18 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024
   19 conv    512  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x 512
   20 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024
   21 conv    512  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x 512
   22 conv   1024  3 x 3 / 1    19 x  19 x 512   ->    19 x  19 x1024
   23 conv   1024  3 x 3 / 1    19 x  19 x1024   ->    19 x  19 x1024
   24 conv   1024  3 x 3 / 1    19 x  19 x1024   ->    19 x  19 x1024
   25 route  16
   26 conv     64  1 x 1 / 1    38 x  38 x 512   ->    38 x  38 x  64
   27 reorg              / 2    38 x  38 x  64   ->    19 x  19 x 256
   28 route  27 24
   29 conv   1024  3 x 3 / 1    19 x  19 x1280   ->    19 x  19 x1024
   30 conv    425  1 x 1 / 1    19 x  19 x1024   ->    19 x  19 x 425
   31 detection
Loading weights from /home/arsey/projects/darknet/yolo.weights...Done!
Dumping video
./run_optimizerset.sh: line 36: 12270 Segmentation fault      (core dumped) /home/arsey/noscope/tensorflow-noscope/bazel-bin/tensorflow/noscope/noscope --diff_thresh=0 --distill_thresh_lower=0 --distill_thresh_u
pper=0 --skip_small_cnn=0 --skip_diff_detection=0 --skip=30 --avg_fname=/home/arsey/noscope/data/cnn-avg/jackson-town-square.txt --graph=/home/arsey/n
oscope/data/cnn-models/jackson-town-square_convnet_128_32_2.pb --video=/home/arsey/noscope/data/videos/jackson-town-square.mp4 --yolo_cfg=/home/arsey/projects/darknet/cfg/yolo.cfg --yolo_weights=/home/arsey/proj
ects/darknet/yolo.weights --yolo_class=2 --confidence_csv=/home/arsey/noscope/data/experiments/jackson-town-square/train/jackson-town-square_convnet_128_32_2.pb-non_blocked_mse.src/train_${START_FRAME}_${END_FRA
ME}.csv --start_from=${START_FRAME} --nb_frames=$LEN --dumped_videos=/home/arsey/noscope/data/video-cache/jackson-town-square_0_250_1.bin --diff_detection_weights=/dev/null --use_blocked=0 --ref_image=0

real    0m2.665s
user    0m2.176s
sys     0m0.620s
Arsey commented 7 years ago

Any thoughts?

Megatron2032 commented 7 years ago

thanks,0.8 is useful for 1070,it works.But,it also have a problem that is no memory error! my computer memory is 8GB.When i run the motherdog.py, if i choose high frames or low target_fp,the problem will appear.I want to use 918000 frames and low target_fp to run motherdog,how to change the code?

Arsey commented 7 years ago

Segmentation fault issue was related to the wrong number of frames set for training (250) in noscope_motherdog.py, and a video had 30 frames per second. So the error was appearing in noscope_data.cc inside of a for loop:

  for (size_t i = 0; i < kNbFrames; i++) {
    cap >> frame;
    if (i % kSkip_ == 0) {
      std::cout << "frame: " << i << "\n";
      const size_t ind = i / kSkip_;
      cv::resize(frame, yolo_frame, NoscopeData::kYOLOResol_, 0, 0, cv::INTER_NEAREST);
      cv::resize(frame, diff_frame, NoscopeData::kDiffResol_, 0, 0, cv::INTER_NEAREST);
      cv::resize(frame, dist_frame, NoscopeData::kDistResol_, 0, 0, cv::INTER_NEAREST);
      dist_frame.convertTo(dist_frame_f, CV_32FC3);

      if (!yolo_frame.isContinuous()) {
        throw std::runtime_error("yolo frame is not continuous");
      }
      if (!diff_frame.isContinuous()) {
        throw std::runtime_error("diff frame is not continuous");
      }
      if (!dist_frame.isContinuous()) {
        throw std::runtime_error("dist frame is not conintuous");
      }
      if (!dist_frame_f.isContinuous()) {
        throw std::runtime_error("dist frame f is not continuous");
      }

      memcpy(&yolo_data_[ind * kYOLOFrameSize_], yolo_frame.data, kYOLOFrameSize_);
      memcpy(&diff_data_[ind * kDiffFrameSize_], diff_frame.data, kDiffFrameSize_);
      memcpy(&dist_data_[ind * kDistFrameSize_], dist_frame_f.data, kDistFrameSize_ * sizeof(float));
    }
  }
ddkang commented 7 years ago

Unfortunately, the codebase currently assumes videos are 30 FPS.

Megatron2032 commented 7 years ago

I have 8GB memory ,so I use 270000 frames and run run_optimizerset.sh separately in four steps.At last,it works.