livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
538 stars 169 forks source link

CUDA device mapping TensorFlow error when loading DNN model #1980

Open yondonfu opened 3 years ago

yondonfu commented 3 years ago
E tensorflow/core/common_runtime/session.cc:91] Failed to create session: Already exists: TensorFlow device (GPU:0) is being mapped to multiple CUDA devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not  currently supported, see https://github.com/tensorflow/tensorflow/issues/19083 
E tensorflow/c/c_api.cc:2184] Already exists: TensorFlow device (GPU:0) is being mapped to multiple CUDA devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not  currently supported, see https://github.com/tensorflow/tensorflow/issues/19083 
[dnn_tensorflow @ 0x7f57a4052500] Failed to create new session with model graph 
[dnn_tensorflow @ 0x7f57a4052500] Failed to load native model 
[livepeer_dnn @ 0x7f57a4058400] could not load DNN model 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator fatal error: unexpected signal during runtime execution 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x476085] 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator runtime stack: 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator runtime.throw(0x1f1d68a, 0x2a) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /usr/local/go/src/runtime/panic.go:1116 +0x72 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator runtime.sigpanic() 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /usr/local/go/src/runtime/signal_unix.go:726 +0x4ac 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator goroutine 19245 [syscall]: 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator runtime.cgocall(0x11b4810, 0xc00091d810, 0x1d69e20) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /usr/local/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00091d7e0 sp=0xc00091d7a8 pc=0x4a64bb 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/lpms/ffmpeg._Cfunc_lpms_transcode_new_with_dnn(0xc000a4f240, 0x0) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  _cgo_gotypes.go:370 +0x4a fp=0xc00091d810 sp=0xc00091d7e0 pc=0xd14b6a 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/lpms/ffmpeg.NewTranscoderWithDetector.func5(0xc000a4f240, 0xc0007ad480) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /go/pkg/mod/github.com/livepeer/lpms@v0.0.0-20210806125031-9fdbf80c8575/ffmpeg/ffmpeg.go:541 +0x4d fp=0xc00091d840 sp=0xc00091d810 pc=0xd1a2ed 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/lpms/ffmpeg.NewTranscoderWithDetector(0x211dfc0, 0xc00012d720, 0x7ffda59a3db6, 0x1, 0x0, 0x0, 0x0) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /go/pkg/mod/github.com/livepeer/lpms@v0.0.0-20210806125031-9fdbf80c8575/ffmpeg/ffmpeg.go:541 +0x245 fp=0xc00091d8e0 sp=0xc00091d840 pc=0xd17d25 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/go-livepeer/core.NewNvidiaTranscoderWithDetector(0x211dfc0, 0xc00012d720, 0x7ffda59a3db6, 0x1, 0x1, 0x7ffda59a3db6, 0x1, 0xc000aae7b0) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /build/core/transcoder.go:149 +0x4d fp=0xc00091d940 sp=0xc00091d8e0 pc=0x1103d6d 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession(0xc0002ca000, 0xc00060d860, 0x0, 0x0, 0x0) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /build/core/lb.go:95 +0x389 fp=0xc00091da78 sp=0xc00091d940 pc=0x10f5789 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).Transcode(0xc0002ca000, 0xc00060d860, 0x2e89680, 0x16ee98, 0x1ffe00) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /build/core/lb.go:67 +0xd4 fp=0xc00091dae8 sp=0xc00091da78 pc=0x10f5254 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSeg(0xc0005cc420, 0x214a480, 0xc000e704c0, 0x214a480, 0xc000e704c0, 0xc000e70480, 0xc00060d860, 0x781ba5) 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator  /build/core/orchestrator.go:550 +0x3dc fp=0xc00091de80 sp=0xc00091dae8 pc=0x10fb8bc 
Aug 06 08:33:11 ewr-prod-livepeer-orchestrator-10 orchestrator github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop.func1(0x214a480, 0xc000e704c0, 0x214a480, 0xc000e704c0, 0xc0005cc420, 0xc00060d860, 0xc013210600, 0x214a480, 0xc000e704c0, 

This occurred when using the branch for https://github.com/livepeer/go-livepeer/pull/1979.

jailuthra commented 3 years ago

https://github.com/livepeer/go-livepeer/pull/1979/commits/c17d16e113775f0957bcaefb41a65b66dd756ea6 should fix this in the short-term by hardcoding tensorflow models to always run on whatever the system calls GPU 0. This will be consistent with what we did prior to #1979.

The way we currently configure tensorflow devices in ffmpeg, whatever device we pass through visible_device_list should be mapped to a virtual tensorflow device "0".

According to the linked issue in the logs https://github.com/tensorflow/tensorflow/issues/19083 this is supported now, so we should keep this issue open to investigate why it failed for us here.

yondonfu commented 3 years ago

TODOs