livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
546 stars 169 forks source link

Specifying -nvidia 0 results in GPU 1 being selected -nvidia 1 selects GPU 0 #2255

Open papabear99 opened 2 years ago

papabear99 commented 2 years ago

Describe the bug Specifying -nvidia 0 results in GPU 1 being selected -nvidia 1 selects GPU 0

To Reproduce Steps to reproduce the behavior:

  1. Add -nvidia 0 to start command
  2. Start Livepeer
  3. View console output and observe "Transcoding on these Nvidia GPUs: [1]" Also confirmed in use that the incorrect GPU is selected and used for transcoding.
  4. This behavior also happens in reverse. -nvidia 1 selects GPU 0.

Expected behavior The GPU specified by the -nvidia flag should choose the GPU specified.

Desktop (please complete the following information):

Additional context The behavior is present using both combined O+T and stand alone T configurations I have not tried this on machines with more than 2 GPUs so I don't know what happens if -nvidia 2 is specified

RyanC92 commented 2 years ago

I did notice this and I thought it was my system doing that. glad I'm not the only one.

thomshutt commented 2 years ago

I haven't dug into which GPU actually gets used yet, but I haven't been able to reproduce Step [3] of the reported behaviour:

$ livepeer -transcoder -orchSecret=123 -orchAddr=123 -nvidia 0
I0303 11:32:54.434922   47524 livepeer.go:274] ***Livepeer is running on the offchain network***
I0303 11:32:54.436043   47524 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]
$ livepeer -transcoder -orchSecret=123 -orchAddr=123 -nvidia 1
I0303 11:34:06.017142   47564 livepeer.go:274] ***Livepeer is running on the offchain network***
I0303 11:34:06.020906   47564 livepeer.go:319] Transcoding on these Nvidia GPUs: [1]

and looking at the code, this part more or less just parses the command line arguments and logs them, so I'm not sure how the log line could end up being wrong:

https://github.com/livepeer/go-livepeer/blob/bd5e3427ebc178c7bbf683f18bbd3ab52be59011/cmd/livepeer/livepeer.go#L315-L319

https://github.com/livepeer/go-livepeer/blob/bd5e3427ebc178c7bbf683f18bbd3ab52be59011/common/util.go#L444-L449

I'll carry on checking whether the GPU selected is/isn't the one that actually gets used, but could you double check what you're seeing logged out please @papabear99? If you're still able to reproduce then a full log of the startup might be useful

papabear99 commented 2 years ago

Yes the issue is still present. If GPU 0 is specified in the start command as-nividia 0the log (as yours above) shows GPU 0 is selected but when I look at GPU activity using nvidia-smi or the task manager I can see that GPU 1 is actually the GPU that is being used. It works the other way if -nividia 1 is specified.

On a single GPU node-nvidia all shows and uses GPU 0

Note: I don't if this is Windows specific as I don't have any Transcoders running on Linux

@NightWolf92 Does it happen for you on Linux?

thomshutt commented 2 years ago

Thanks @papabear99, if you're seeing the correct GPU in the log line now then that's good - I'll carry on with investigating why that differs from the GPU that actually gets used

RyanC92 commented 2 years ago

@papabear99 @thomshutt I am only seeing this on Windows, it seems to be fine on Linux.

my config

transcoder
orchSecret <secret>
reward false
maxSessions 20
orchAddr <ip>
cliAddr 127.0.0.1:<port>
nvidia 1 #As you can see nvidia 1 is selected
monitor 
v 6

image

GPU 0 is the 1070, this is what is being utilized in the config file. GPU 1 is my 3080 TI. However you can see that although I put nvidia 1 as the option in the config, its utilizing GPU 0. if I were to put nvidia 0 it would use GPU 1

thomshutt commented 2 years ago

I've spun up a Windows Server 2022 machine in Google Cloud, with 4 Nvidia Tesla T4s attached.

I ran the following commands to stand up a local Broadcaster / Orchestrator / Transcoder environment:

Broadcaster

./livepeer -datadir ~/temp/livepeer/TEST_B -broadcaster -cliAddr :7937 -httpAddr :8936 -orchAddr 127.0.0.1:8935 -v 6

Orchestrator

./livepeer -datadir ~/temp/livepeer/TEST -orchestrator -orchSecret foo -serviceAddr 127.0.0.1:8935 -v 6

Transcoder

./livepeer -datadir ~/temp/livepeer/TEST_T1 -transcoder -orchSecret foo -orchAddr 127.0.0.1:8935 -cliAddr :7936 -nvidia 0 -v 6

I see the expected log line on startup

I0308 15:53:50.331431    1468 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]

and the correct GPU being used when I send a job in to be transcoded (also works correctly when specifying the other GPUs):

image
RyanC92 commented 2 years ago

I've spun up a Windows Server 2022 machine in Google Cloud, with 4 Nvidia Tesla T4s attached.

I ran the following commands to stand up a local Broadcaster / Orchestrator / Transcoder environment:

Broadcaster

./livepeer -datadir ~/temp/livepeer/TEST_B -broadcaster -cliAddr :7937 -httpAddr :8936 -orchAddr 127.0.0.1:8935 -v 6

Orchestrator

./livepeer -datadir ~/temp/livepeer/TEST -orchestrator -orchSecret foo -serviceAddr 127.0.0.1:8935 -v 6

Transcoder

./livepeer -datadir ~/temp/livepeer/TEST_T1 -transcoder -orchSecret foo -orchAddr 127.0.0.1:8935 -cliAddr :7936 -nvidia 0 -v 6

I see the expected log line on startup

I0308 15:53:50.331431    1468 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]

and the correct GPU being used when I send a job in to be transcoded (also works correctly when specifying the other GPUs):

image

Hm. not sure why yours is proper whereas ours isnt.. See below.

image

papabear99 commented 2 years ago

I wonder if it releated to the keylase patch since you don't have to use it with the tesla cards. Even though I have a quadro which is unlocked by default I have it a machine with a GTX so it running with patched drivers.

I can certainly work around it and it's probably not worth spending time on as I assume most use all their cards.

RyanC92 commented 2 years ago

Its fine to not fix this, because its not a huge issue, but for example, I had my main 3080 TI running when I thought my 1070 running for a few weeks and then I realized so it for other people it would just have unintended consequences until the realize.

thomshutt commented 2 years ago

image

@papabear99 The only thing that stands out as unusual from that screenshot is that even though the Nvidia util is reporting GPU 0, I see that its utilization is at 0% versus 18% for the other one - is that just a result of something else running on that GPU?

The keylase patch could potentially be responsible, are you also running the patched drivers @NightWolf92?

Other than that, we might have to put this on the backburner until I can figure out how to reproduce it.

papabear99 commented 2 years ago

It's @NightWolf92's screenshot but I'm going to take a guess that he just happened to capture it at the bottom of the sawtooth pattern (fairly typical encoding pattern for Livepeer segments) so Task Manager is showing 0%. Regarding nvidia-smi, that looks like a static view and is just to show that GPU 0 is assigned to Livepeer even though nvidia 1 is specified to the right.

papabear99 commented 2 years ago

I don't know how Google Cloud works, but if you can install the keylase patch since it's compatible with GPUs that don't require it and see if the issue pops up afterward then we can confirm it's related to the patch.

RyanC92 commented 2 years ago

It's @NightWolf92's screenshot but I'm going to take a guess that he just happened to capture it at the bottom of the sawtooth pattern (fairly typical encoding pattern for Livepeer segments) so Task Manager is showing 0%. Regarding nvidia-smi, that looks like a static view and is just to show that GPU 0 is assigned to Livepeer even though nvidia 1 is specified to the right. @thomshutt

Correct, the screenshot timeframes are off so they arent updating live. As you can see from NVIDIA-SMI it shows 11:29 but the livepeer window is 11:34 taken a few minutes apart. The GPUs are patched since they are 1070/3080's which are normally session capped.

thomshutt commented 2 years ago

Same result (working as expected) after patching the driver with keylase unfortunately

RyanC92 commented 2 years ago

Same result (working as expected) after patching the driver with keylase unfortunately

Ah well that solves it then! Thank you for looking into it. If there's any reference to the patch in livepeer documentation maybe note this?

thomshutt commented 2 years ago

I could've probably phrased that better sorry 🤦 I meant that I still can't reproduce this issue even after patching the driver and am getting the same correct results as I was before

papabear99 commented 2 years ago

Same result (working as expected) after patching the driver with keylase unfortunately

Alright I'm out of ideas, other than maybe it's fixed in Win 2022 or maybe is only present when running different model GPUs. Even though I have heard from another O that he has the same issue (Win 10) I don't think it's worth spending anymore time looking into.

Maybe a note in the docs (that nobody will read lol)?