livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
546 stars 171 forks source link

Malicious Stream Causing GPU Crashes and Node Disruptions #2970

Open FranckUltima opened 8 months ago

FranckUltima commented 8 months ago

Several orchestrators have noticed, in recent weeks, the existence of one or more streams causing GPU crashes (cuda error) which leads to the loss of all running streams, and even to the crash of the transcoder/node. The stream then switches to another orchestrator, causing the same results.

I am concerned about the impact of this kind of event on the quality of the network, and the possibility that such streams could also be voluntarily injected into the network.

Following the discussion yesterday evening on the watercooler chat, Doug considered the possibility that the Broadcaster node could identify the stream and delete it automatically to prevent it from spreading, but if the stream was injected voluntarily, it might be better to be able to detect it and block it at the orchestrator level.

I have not been able to identify the responsible stream, and this event remains rather rare, but when it happens, several orchestrators usually report it on Discord.

thomshutt commented 8 months ago

@FranckUltima I'd assume for now that it's a problem with the stream rather than something malicious but let's try to get to the bottom of it as soon as w e can. Please update the ticket once you have more details

eliteprox commented 8 months ago

Here's logs for a stream that crashed both transcoder processes behind my orchestrator

Transcoder

[AVHWDeviceContext @ 0x7f4ae3d9c600] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted
ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
E0306 06:53:47.127451       1 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
E0306 06:53:47.127523       1 ot_rpc.go:197] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c taskId=560581 Transcoding done for taskId=560581 url=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile dur=45.193225ms err="Unrecoverable state, restart process"
E0306 06:53:47.127537       1 ot_rpc.go:252] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c taskId=560581 Unable to transcode err="Unrecoverable state, restart process"
panic: Unrecoverable state, restart process

goroutine 554202 [running]:
github.com/livepeer/go-livepeer/server.runTranscode.func2()
        /src/server/ot_rpc.go:200 +0x2a
github.com/livepeer/go-livepeer/server.runTranscode(0xc000254820, {0xc00060e008, 0x19}, 0xc00081c000?, 0xc000730540)
        /src/server/ot_rpc.go:203 +0x1002
github.com/livepeer/go-livepeer/server.runTranscoder.func2()
        /src/server/ot_rpc.go:135 +0x36
created by github.com/livepeer/go-livepeer/server.runTranscoder
        /src/server/ot_rpc.go:134 +0x7ab

Orchestrator

E0306 06:53:46.476627       1 ot_rpc.go:361] Transcoding error for taskId=560580 err="Unrecoverable state, restart process"
E0306 06:53:46.476665       1 orchestrator.go:846] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 clientIP=79.127.219.2 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=172.18.0.1:49808 segments=0 taskId=560580 fname=https://100.4.105.177:8933/stream/8aa951b6/0.tempfile dur=29.021356ms err="Unrecoverable state, restart process"
E0306 06:53:46.476676       1 orchestrator.go:592] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=79.127.219.2 Error transcoding segName=https://100.4.105.177:8933/stream/8aa951b6/0.tempfile err="Unrecoverable state, restart process"
E0306 06:53:46.476818       1 segment_rpc.go:230] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 clientIP=79.127.219.2 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="Unrecoverable state, restart process"
E0306 06:53:46.612836       1 orchestrator.go:927] Stream closed for transcoder=172.18.0.1:49808, err="context canceled"
I0306 06:53:46.612858       1 orchestrator.go:945] Got transcoder=172.18.0.1:49808 eof, removing from live transcoders map
E0306 06:53:47.127783       1 ot_rpc.go:361] Transcoding error for taskId=560581 err="Unrecoverable state, restart process"
E0306 06:53:47.127835       1 orchestrator.go:846] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c clientIP=79.127.199.49 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=172.18.0.1:49792 segments=0 taskId=560581 fname=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile dur=53.993053ms err="Unrecoverable state, restart process"
E0306 06:53:47.127848       1 orchestrator.go:592] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=79.127.199.49 Error transcoding segName=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile err="Unrecoverable state, restart process"
E0306 06:53:47.128030       1 segment_rpc.go:230] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c clientIP=79.127.199.49 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="Unrecoverable state, restart process"
E0306 06:53:47.276797       1 orchestrator.go:927] Stream closed for transcoder=172.18.0.1:49792, err="context canceled"
papabear99 commented 8 months ago

Scrolling back through my logs, I found this on my NYC combined OT (running in docker) that caused the container to restart

I0305 02:46:39.839962       1 orchestrator.go:612] manifestID=ac901a60-bab6-4dd9-9a59-15626f3ca54c seqNo=28159 orchSessionID=26e483d7 clientIP=79.127.221.98 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Transcoded segment profile=720p bytes=147392
[AVHWDeviceContext @ 0x7f8d0d453c00] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted
ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
E0305 02:46:39.860655       1 ffmpeg.go:990] Transcoder Return : Unrecoverable state, restart process
I0305 02:46:39.862226       1 lb.go:216] manifestID=manifest-ob5vqh6p seqNo=979 orchSessionID=f471acc1 clientIP=79.127.219.9 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=f471acc1_0
I0305 02:46:39.862265       1 lb.go:139] manifestID=manifest-ob5vqh6p seqNo=979 orchSessionID=f471acc1 clientIP=79.127.219.9 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Deleted transcode session for key=f471acc1_0
panic: Unrecoverable state, restart process
goroutine 15397716 [running]:
github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSeg(0xc000176000, {0x89d2d0, 0xc00ba64840}, {{0x8a39d8?, 0xc00165f340?}, {0x8a39d8?, 0xc00165f340?}}, 0xc00165f300, 0xc00f5809a0)
    /src/core/orchestrator.go:579 +0xb50
github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop.func1()
    /src/core/orchestrator.go:671 +0x9b
created by github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop
    /src/core/orchestrator.go:656 +0x47a
Tibia2000 commented 8 months ago

Here are mine from some time ago:


Feb 20 17:05:56  livepeer[2796953]: I0220 17:05:56.699107 2796953 lb.go:123] manifestID=369082d3-0c72-4a77-8169-bb462bbbd794 seqNo=12248 orchSessionID=6d5f998a clientIP=143.244.60.233 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Deleted transcode session for key=6d5f998a_0
Feb 20 17:05:56  livepeer[2796953]: panic: Unrecoverable state, restart process
Feb 20 17:05:56  livepeer[2796953]: goroutine 1925973 [running]:
Feb 20 17:05:56  livepeer[2796953]: github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSeg(0xc000192680, {0x8c26b0, 0xc000d493b0}, {{0x8c8e10?, 0xc0009bed00?}, {0x8c8e10?, 0xc0009bed00?}}, 0xc0009becc0, 0xc001664480)
Feb 20 17:05:56 livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:590 +0xb45
Feb 20 17:05:56  livepeer[2796953]: github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop.func1()
Feb 20 17:05:56  livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:682 +0x9b
Feb 20 17:05:56 livepeer[2796953]: created by github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop
Feb 20 17:05:56  livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:667 +0x41f````
Tibia2000 commented 8 months ago

USA again:


Mar 12 18:39:46  livepeer[1406531]: [AVHWDeviceContext @ 0x7f44f77524c0] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Mar 12 18:39:46  livepeer[1406531]: ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
Mar 12 18:39:46  livepeer[1406531]: ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
Mar 12 18:39:46  livepeer[1406531]: E0312 18:39:46.694127 1406531 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
Mar 12 18:39:46  livepeer[1406531]: I0312 18:39:46.694208 1406531 lb.go:200] manifestID=ecd66igl6cduxdga seqNo=6 orchSessionID=c0ff46a0 clientIP=143.244.61.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=c0ff46a0_0```
Tibia2000 commented 8 months ago
Mar 12 18:31:18  livepeer[1057068]: [AVHWDeviceContext @ 0x7fc6dd293100] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Mar 12 18:31:18  livepeer[1057068]: ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
Mar 12 18:31:18  livepeer[1057068]: ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
Mar 12 18:31:18  livepeer[1057068]: E0312 18:31:18.408003 1057068 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
Mar 12 18:31:18  livepeer[1057068]: I0312 18:31:18.408473 1057068 lb.go:200] manifestID=53dcd30f-d785-460e-8382-b0accc7e0db4 seqNo=3811 orchSessionID=3c2de895 clientIP=79.127.199.50 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=3c2de895_0
eliteprox commented 5 months ago

I lost 8 streams in EU region when transcoder hit an error. It looks like other orchestrators were affected within a similar timeframe. Some of the data in the logs looks suspicious. The transcoder was at 31% utilization and transcode times were nominal so it should not have run out of memory.

Time of impact: 15:33 GMT-0 June 13, 2024

Attached are logs from both orchestrator and transcoder. transcoder logs..txt orchestrator_logs.txt

Transcoder logs show what appears to be malicious input going to EncodePicture Note: The characters are not all visible in markdown, but can be seen in the attachments

I0613 15:33:22.373218       1 ot_rpc.go:144] Transcoding taskId=1974134 url=https://85.10.207.72:8933/stream/533f41a9/764.tempfile
[h264_nvenc @ 0x7fa236a96380] EncodePicture failed!: generic error (20):    ey? [ ?  ? v  :?  ??  E    p7[ߪ? * :
   "] J? ?  ?u  @   E?  ? 9 ?U?
           E "([
ERROR: encoder.c:353] Error sending frame to encoder : Unknown error occurred
[h264_cuvid @ 0x7fa2353e83c0] ctx->cvdl->cuvidMapVideoFrame(ctx->cudecoder, parsed_frame.dispinfo.picture_index, &mapped_frame, &pitch, &params) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
ERROR: decoder.c:78] Error sending packet to decoder : Cannot allocate memory
ERROR: transcoder.c:708] Could not decode; stopping : Cannot allocate memory
E0613 15:33:22.453031       1 ffmpeg.go:986] Transcoder Return : Cannot allocate memory
E0613 15:33:22.453110       1 ot_rpc.go:197] manifestID=2f0et6t020t2v99q seqNo=764 orchSessionID=533f41a9 taskId=1974134 Transcoding done for taskId=1974134 url=https://85.10.207.72:8933/stream/533f41a9/764.tempfile dur=74.4882ms err="Cannot allocate memory"
E0613 15:33:22.453136       1 ot_rpc.go:252] manifestID=2f0et6t020t2v99q seqNo=764 orchSessionID=533f41a9 taskId=1974134 Unable to transcode err="Cannot allocate memory"
I0613 15:33:22.465741       1 ot_rpc.go:144] Transcoding taskId=1974135 url=https://85.10.207.72:8933/stream/a46e38a6/6142.tempfile
[AVHWDeviceContext @ 0x7fa236d67e00] cu->cuMemFree((CUdeviceptr)data) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[AVHWDeviceContext @ 0x7fa236d67e00] cu->cuMemFree((CUdeviceptr)data) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[AVHWDeviceContext @ 0x7fa236d67e00] cu->cuMemFree((CUdeviceptr)data) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

The problematic manifest appears to be: manifestID=6622kjcn94p6fejw seqNo=3749 orchSessionID=b541d756 clientIP=138.199.36.76

Public broadcaster logs for manifest 6622kjcn94p6fejw

I0613 15:40:51.704359       1 broadcast.go:348] [PublicLogs] manifestID=6622kjcn94p6fejw seqNo=3974 orchSessionID=34521060 orchestrator=https://null.anlivepeer.com:8935 clientIP=195.181.174.186 Removing orch=https://null.anlivepeer.com:8935 from manifestID=6622kjcn94p6fejw session list
I0613 15:33:23.680666       1 broadcast.go:348] [PublicLogs] manifestID=6622kjcn94p6fejw seqNo=3750 orchSessionID=b541d756 orchestrator=https://85.10.207.72:8933 clientIP=195.181.174.186 Removing orch=https://85.10.207.72:8933 from manifestID=6622kjcn94p6fejw session list
I0613 15:22:31.784002       1 broadcast.go:348] [PublicLogs] manifestID=6622kjcn94p6fejw seqNo=3424 orchSessionID=44b5832c orchestrator=https://162.244.81.240:8953 clientIP=195.181.174.186 Removing orch=https://162.244.81.240:8953 from manifestID=6622kjcn94p6fejw session list
I0613 15:20:03.636773       1 broadcast.go:348] [PublicLogs] manifestID=6622kjcn94p6fejw seqNo=3350 orchSessionID=19356694 orchestrator=https://85.4.180.49:8935 clientIP=195.181.174.186 Removing orch=https://85.4.180.49:8935 from manifestID=6622kjcn94p6fejw session list
I0613 14:31:31.490458       1 broadcast.go:348] [PublicLogs] manifestID=6622kjcn94p6fejw seqNo=1894 orchSessionID=1fa54e06 orchestrator=https://eu-fra.grant-node.xyz:18935 clientIP=195.181.174.186 Removing orch=https://eu-fra.grant-node.xyz:18935 from manifestID=6622kjcn94p6fejw session list

It is also notable that these issues occurred during the ramp up of a gas price spike