CudaException 77 - Githubissues

turowicz commented 5 years ago

Hey @takuya-takeuchi

Unfortunately error still exists. It happens when multiple threads try to do face recognition at the same time.

Stack trace:

info: People.Service[0]
      CUDA Error Lib:libDlibDotNet.Native.Dnn.so Code:77 D:10000 R:10000 M:Exception of type 'DlibDotNet.CudaException' was thrown..
fail: People.Service[0]
      Exception of type 'DlibDotNet.CudaException' was thrown.
DlibDotNet.CudaException: Exception of type 'DlibDotNet.CudaException' was thrown.
   at DlibDotNet.Dnn.Cuda.ThrowCudaException(ErrorType error)
   at DlibDotNet.Dnn.LossMmod.Operator[T](IEnumerable`1 images, UInt64 batchSize)
   at FaceRecognitionDotNet.Dlib.Python.CnnFaceDetectionModelV1.Detect(LossMmod net, Image image, Int32 upsampleNumTimes)
   at FaceRecognitionDotNet.FaceRecognition.RawFaceLocations(Image faceImage, Int32 numberOfTimesToUpsample, Model model)
   at FaceRecognitionDotNet.FaceRecognition.FaceLocations(Image image, Int32 numberOfTimesToUpsample, Model model)+MoveNext()
   at System.Collections.Generic.List`1.AddEnumerable(IEnumerable`1 enumerable)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)

turowicz commented 5 years ago

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
| 51%   66C    P2   182W / 200W |   2288MiB /  8119MiB |     42%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23907      C   dotnet                                      1138MiB |
|    0     24173      C   dotnet                                      1138MiB |
+-----------------------------------------------------------------------------+

turowicz commented 5 years ago

I think this is caused by 2 threads using 2 instances of FaceRecognition. I can make them use a single instance and lock access, but that will be a lot slower than parallel work.

takuya-takeuchi commented 5 years ago

Did you face issue when using 2 process? I think It is no.

python face_recognition uses multi processing. https://github.com/ageitgey/face_recognition/blob/master/face_recognition/face_recognition_cli.py

But python multitprocessing module create subprocess to avoid GIL peformance issue. https://docs.python.org/3.6/library/multiprocessing.html#module-multiprocessing

so it can access cuda on same time rather than dotnet.

turowicz commented 5 years ago

@takuya-takeuchi thanks for your response.

Multiprocess works fine. I'm having troubles with concurrency in a single process though.

So far I've introduces asynchronous locks (SemaphoreSlim.WaitAsync) and I don't get the errors anymore. This does have an impact on performance though, so I will need to either move to multiprocess approach or look into solving the threading issues.

This answers the questions, closing the issue.

atabora1 commented 4 years ago

@turowicz do you mind sharing how you got it to work with the locks?

turowicz commented 4 years ago

@atabora1

I've moved away from consuming the models in .NET and started using NVIDIA Triton Inference server. I now only make GRPC calls for inference.

If you want to keep hosting the model in .NET memory though, you can lock the process in the following way:

static SemaphoreSlim Lock = new SemaphoreSlim(1);

async Task Work(...)
{
   await Lock.WaitAsync();

   try
   {
        <inference code here>
   }
   finally
   {
      Lock.Release();
   }
}

atabora1 commented 4 years ago

Thanks @turowicz for the new ideas. I've been using 4 Threads instead of Tasks and I've noticing a major performance improvement.

takuya-takeuchi / FaceRecognitionDotNet

CudaException 77 #24