External GPU plan - Githubissues

j0sh commented 5 years ago

We should determine how external GPUs should work within the test harness, and the steps / tasks required to achieve that.

ya7ya commented 5 years ago

I wrote a quick document https://hackmd.io/@aDTLI7NSTau8FrWglfCb5g/rkx2bVLzS explaining the options we have to be able to run gpu enabled machines within the test-harness, please take a look and let me know which option do you think is best , feedback is very welcomed cc @j0sh @mk-livepeer @darkdarkdragon

mk-livepeer commented 5 years ago

I think the ansible option is too much of a divergence. I would stick with docker. I don't use ansible. Maybe it's awesome, but... If an ansible configuration file scares me, I wonder how it would make regular Joe Miner feel? Miners don't use ansible.

Where did the assumption come from that node operators don't like to use docker? This is simply not the case.

Masternode operators are perfectly happy with installations on bare metal, and when there is a simple docker installation they can follow, they tend to be very happy with that, too.

Why do you describe docker as an extra layer of abstraction? It's not. It's a means of determinism. Docker eliminates the ambiguity of a system. It brings a known constant.

"more realistic gpu memory performance when testing concurrency since there will be no memory management like in docker." Docker does not manage GPU memory. It manages the resource as a whole.

VPN shenanigans... I don't like it either, but I don't know that there is much choice. VPN should slow down our throughput, and that's actually a good thing. It should cripple the transfer speeds a bit and help us aim to a lower common denominator when it comes to data transfer.

Your reasons for not running Docker only prompt me to ask for bare metal install. I don't think that's reasonable for reproducibility at all. I vote docker.

j0sh commented 5 years ago

@ya7ya Thanks for the writeup and detailed breakdown - this was very helpful!

I'll leave most of the commentary to the devops experts, but just one note about this:

miners (target user)

The target user is really Livepeer developers - we want to be able to use this for our internal testing for non-GCP boxes that we have under our control (eg, office rig, Genesis, Bison Trails, etc). If this setup turns out to be attractive to regular miners / transcoders within a production environment, then that's a bonus - but we should tailor our approach towards meeting our own needs first.

Beyond that, the only comment I have is that various folks have been working on-and-off to get GPU transcoding working via Docker, and AFAIK we're not quite there yet, eg https://github.com/livepeer/go-livepeer/issues/1009. If we commit to the Docker path, we should ensure that we have these types of unknowns under control.

iameli commented 5 years ago

Beyond that, the only comment I have is that various folks have been working on-and-off to get GPU transcoding working via Docker, and AFAIK we're not quite there yet, eg livepeer/go-livepeer#1009.

I'm writing a longer response, but worth noting that GPU transcoding actually works great in that setting — it's just GPU transcoding of segments from Wowza that're failing. Somehow.

darkdarkdragon commented 5 years ago

@j0sh Thanks for clarification about target user - it was unclear from Yahya's document. Small question - do we have somewhere separate task to integrate GCP GPU transcoding into the test harness?

About plan itself:

VPN only matters at the time of provisioning - because Ts connect to Os by themselves, so they can connect through open internet (if they have access to it) to public IPs of Os (I assume only Ts will be running on rigs machines, all other nodes will be running on Google's infrastructure).
If GPU transcoding in docker actually works, then, of course, we should go with docker - it will be easier/faster.
There is open question with tests scheduling - if I want to run tests, how I should check that I don't mess with test that someone else is already running?

About work track: I think that create nvidia docker installation script suggests that docker will be installed every time test harness is run, but I think this shouldn't be the case - docker daemon should be installed once, manually, and test harness should just use it to connect to own swarm and run containers. Or we can go another route - use same Docker Swarm for everyone - this would require maintaining permanent docker manager machine, but will totally eliminate need for test harness to connect to machines with GPUs - they would be connected to Swarm permanently, and test harness will just use them as resources. But that would require bigger changes to test harness, and some way for everyone for getting credentials for that Docker Manager machine (but that should be easy part).

ya7ya commented 5 years ago

"more realistic gpu memory performance when testing concurrency since there will be no memory management like in docker." Docker does not manage GPU memory. It manages the resource as a whole.

@mk-livepeer This is correct, turns out nvidia-docker doesn't manage GPU memory, only system resources like CPU and RAM

I think that create nvidia docker installation script suggests that docker will be installed every time test harness is run, but I think this shouldn't be the case

@darkdarkdragon Yes, I mean nvidia-docker won't be installed by default on these machines. We could as you suggested run a permanent docker manager machine, but that adds other issue aswell.

VPN only matters at the time of provisioning - because Ts connect to Os by themselves, so they can connect through open internet (if they have access to it)

These machines doesn't have access to the open internet.

(I assume only Ts will be running on rigs machines, all other nodes will be running on Google's infrastructure).

Yes, remember these machines don't have public IPs so we're gonna need to work around that so the Transcoding machine can access the other parts hosted on GCP

The target user is really Livepeer developers - we want to be able to use this for our internal testing for non-GCP boxes that we have under our control (eg, office rig, Genesis, Bison Trails, etc).

@j0sh Agreed, I'll edit that in the document

Beyond that, the only comment I have is that various folks have been working on-and-off to get GPU transcoding working via Docker, and AFAIK we're not quite there yet, eg livepeer/go-livepeer#1009. If we commit to the Docker path, we should ensure that we have these types of unknowns under control

good point 🤔 , I'll see if we can ensure whether or not this is possible.

ya7ya commented 5 years ago

main tasks to tackle:

[ ] make sure #1009 won't affect transcoding container
[ ] create vpn setup automation to connect external transcoder to gcp deployment
[ ] nvidia docker provisioning and joining the docker swarm
[ ] add needed flags, env vars for enabling gpu transcoding
[ ] collect and store nvidia-smi metrics and pipe it into prometheus

livepeer / test-harness

External GPU plan #75