Open papabear99 opened 2 years ago
I did notice this and I thought it was my system doing that. glad I'm not the only one.
I haven't dug into which GPU actually gets used yet, but I haven't been able to reproduce Step [3] of the reported behaviour:
$ livepeer -transcoder -orchSecret=123 -orchAddr=123 -nvidia 0
I0303 11:32:54.434922 47524 livepeer.go:274] ***Livepeer is running on the offchain network***
I0303 11:32:54.436043 47524 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]
$ livepeer -transcoder -orchSecret=123 -orchAddr=123 -nvidia 1
I0303 11:34:06.017142 47564 livepeer.go:274] ***Livepeer is running on the offchain network***
I0303 11:34:06.020906 47564 livepeer.go:319] Transcoding on these Nvidia GPUs: [1]
and looking at the code, this part more or less just parses the command line arguments and logs them, so I'm not sure how the log line could end up being wrong:
I'll carry on checking whether the GPU selected is/isn't the one that actually gets used, but could you double check what you're seeing logged out please @papabear99? If you're still able to reproduce then a full log of the startup might be useful
Yes the issue is still present. If GPU 0 is specified in the start command as-nividia 0
the log (as yours above) shows GPU 0 is selected but when I look at GPU activity using nvidia-smi or the task manager I can see that GPU 1 is actually the GPU that is being used. It works the other way if -nividia 1
is specified.
On a single GPU node-nvidia all
shows and uses GPU 0
Note: I don't if this is Windows specific as I don't have any Transcoders running on Linux
@NightWolf92 Does it happen for you on Linux?
Thanks @papabear99, if you're seeing the correct GPU in the log line now then that's good - I'll carry on with investigating why that differs from the GPU that actually gets used
@papabear99 @thomshutt I am only seeing this on Windows, it seems to be fine on Linux.
my config
transcoder
orchSecret <secret>
reward false
maxSessions 20
orchAddr <ip>
cliAddr 127.0.0.1:<port>
nvidia 1 #As you can see nvidia 1 is selected
monitor
v 6
GPU 0 is the 1070, this is what is being utilized in the config file. GPU 1 is my 3080 TI. However you can see that although I put nvidia 1
as the option in the config, its utilizing GPU 0. if I were to put nvidia 0
it would use GPU 1
I've spun up a Windows Server 2022 machine in Google Cloud, with 4 Nvidia Tesla T4s attached.
I ran the following commands to stand up a local Broadcaster / Orchestrator / Transcoder environment:
Broadcaster
./livepeer -datadir ~/temp/livepeer/TEST_B -broadcaster -cliAddr :7937 -httpAddr :8936 -orchAddr 127.0.0.1:8935 -v 6
Orchestrator
./livepeer -datadir ~/temp/livepeer/TEST -orchestrator -orchSecret foo -serviceAddr 127.0.0.1:8935 -v 6
Transcoder
./livepeer -datadir ~/temp/livepeer/TEST_T1 -transcoder -orchSecret foo -orchAddr 127.0.0.1:8935 -cliAddr :7936 -nvidia 0 -v 6
I see the expected log line on startup
I0308 15:53:50.331431 1468 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]
and the correct GPU being used when I send a job in to be transcoded (also works correctly when specifying the other GPUs):
I've spun up a Windows Server 2022 machine in Google Cloud, with 4 Nvidia Tesla T4s attached.
I ran the following commands to stand up a local Broadcaster / Orchestrator / Transcoder environment:
Broadcaster
./livepeer -datadir ~/temp/livepeer/TEST_B -broadcaster -cliAddr :7937 -httpAddr :8936 -orchAddr 127.0.0.1:8935 -v 6
Orchestrator
./livepeer -datadir ~/temp/livepeer/TEST -orchestrator -orchSecret foo -serviceAddr 127.0.0.1:8935 -v 6
Transcoder
./livepeer -datadir ~/temp/livepeer/TEST_T1 -transcoder -orchSecret foo -orchAddr 127.0.0.1:8935 -cliAddr :7936 -nvidia 0 -v 6
I see the expected log line on startup
I0308 15:53:50.331431 1468 livepeer.go:319] Transcoding on these Nvidia GPUs: [0]
and the correct GPU being used when I send a job in to be transcoded (also works correctly when specifying the other GPUs):
Hm. not sure why yours is proper whereas ours isnt.. See below.
I wonder if it releated to the keylase patch since you don't have to use it with the tesla cards. Even though I have a quadro which is unlocked by default I have it a machine with a GTX so it running with patched drivers.
I can certainly work around it and it's probably not worth spending time on as I assume most use all their cards.
Its fine to not fix this, because its not a huge issue, but for example, I had my main 3080 TI running when I thought my 1070 running for a few weeks and then I realized so it for other people it would just have unintended consequences until the realize.
@papabear99 The only thing that stands out as unusual from that screenshot is that even though the Nvidia util is reporting GPU 0, I see that its utilization is at 0% versus 18% for the other one - is that just a result of something else running on that GPU?
The keylase patch could potentially be responsible, are you also running the patched drivers @NightWolf92?
Other than that, we might have to put this on the backburner until I can figure out how to reproduce it.
It's @NightWolf92's screenshot but I'm going to take a guess that he just happened to capture it at the bottom of the sawtooth pattern (fairly typical encoding pattern for Livepeer segments) so Task Manager is showing 0%. Regarding nvidia-smi, that looks like a static view and is just to show that GPU 0 is assigned to Livepeer even though nvidia 1 is specified to the right.
I don't know how Google Cloud works, but if you can install the keylase patch since it's compatible with GPUs that don't require it and see if the issue pops up afterward then we can confirm it's related to the patch.
It's @NightWolf92's screenshot but I'm going to take a guess that he just happened to capture it at the bottom of the sawtooth pattern (fairly typical encoding pattern for Livepeer segments) so Task Manager is showing 0%. Regarding nvidia-smi, that looks like a static view and is just to show that GPU 0 is assigned to Livepeer even though nvidia 1 is specified to the right. @thomshutt
Correct, the screenshot timeframes are off so they arent updating live. As you can see from NVIDIA-SMI it shows 11:29 but the livepeer window is 11:34 taken a few minutes apart. The GPUs are patched since they are 1070/3080's which are normally session capped.
Same result (working as expected) after patching the driver with keylase unfortunately
Same result (working as expected) after patching the driver with keylase unfortunately
Ah well that solves it then! Thank you for looking into it. If there's any reference to the patch in livepeer documentation maybe note this?
I could've probably phrased that better sorry 🤦 I meant that I still can't reproduce this issue even after patching the driver and am getting the same correct results as I was before
Same result (working as expected) after patching the driver with keylase unfortunately
Alright I'm out of ideas, other than maybe it's fixed in Win 2022 or maybe is only present when running different model GPUs. Even though I have heard from another O that he has the same issue (Win 10) I don't think it's worth spending anymore time looking into.
Maybe a note in the docs (that nobody will read lol)?
Describe the bug Specifying -nvidia 0 results in GPU 1 being selected -nvidia 1 selects GPU 0
To Reproduce Steps to reproduce the behavior:
Expected behavior The GPU specified by the -nvidia flag should choose the GPU specified.
Desktop (please complete the following information):
Additional context The behavior is present using both combined O+T and stand alone T configurations I have not tried this on machines with more than 2 GPUs so I don't know what happens if -nvidia 2 is specified