selkies-project / docker-nvidia-glx-desktop

KDE Plasma Desktop container designed for Kubernetes, supporting OpenGL EGL and GLX, Vulkan, and Wine/Proton for NVIDIA GPUs through WebRTC and HTML5, providing an open-source remote cloud/HPC graphics or game streaming platform.
https://github.com/selkies-project/docker-nvidia-glx-desktop/pkgs/container/nvidia-glx-desktop
Mozilla Public License 2.0
319 stars 67 forks source link

Can't connect to hardly any Hosts #46

Closed theShiZa closed 8 months ago

theShiZa commented 11 months ago

Hi.

Please help. I used to use Vast.ai a few months ago & it was hit & miss with guessing which Host will run the GLX Desktop, but at least then I was having more success than now. Used exact same GLX Desktop settings now as back then. Now it's the worst I've ever seen it. I get like 80% fail rate when trying Hosts. It's just a lotto. I got told to try Hosts running driver 535.129.03, which so far has been the only version that's worked, but in saying that, there's also been Hosts with same driver 535.129.03 & don't work. It's just a mixed bag of shit.

I just use the default GLX Desktop template at Vast. If the driver & cuda versions of 2 different Hosts are the same, what else could cause one to work & the other to fail?
My internet speed itself is not the factor. This is a quick run-down of what I got today. I've also attached 2 log files of Status Logs & Debug Logs.

m-8914.txt m-11864.txt

SUCCESS ::

** m:88435 - 8x 4090 driver 535.129.03 running cuda 12.2 = YES, working.

** m:14018 - 8x 4090 driver 535.129.03 cuda 12.2 = YES, working.

** m:14791 - 1x 4090 driver 535.129.03 cuda 12.2 = YES, working.

FAIL ::

** m:5150 - 8x 4090 driver 525.105.17 running cuda 12.0 = NO, entered Login info, "Connection Failed" with a RELOAD button.

** m:8914 DataCentre:18 - 8x 4090 driver 525.105.17 cuda 12.0 = NO, entered Login info, "Connection Failed" with a RELOAD button. I have Status Log & Debug Log for this one.

** m:8874 Host 54858 - 12x 4090 driver 535.104.05 cuda 12.2 = NO, error "Site Can't Be Reached".

** m:13471 Host 61247 - 12x 4090 driver 535.113.01 cuda 12.2 = NO, error "Site Can't Be Reached".

** m:11773 Host 61247 - 12x 4090 driver 535.129.03 running cuda 12.2 = NO, error, ""Error response from daemon: --storage-opt is supported only for overlay over xfs with 'pquota' mount option"".

** m:8965 DataCentre:18 - 8x 4090 driver 535.129.03 cuda 12.2 = NO, entered Login info, "Connection Failed" with a RELOAD button. (( 535.129.03 with cuda 12.2 & still didn't work? Something extra odd there as I've succeeded with those driver & cuda versions on other machine. )) I have Status Log & Debug Log for this one.

** m:9111 - 8x 4090 driver 535.129.03 cuda 12.2 = NO, entered Login info, "Connection Failed" with a RELOAD button. I have Status Log & Debug Log for this one. (( 535.129.03 with cuda 12.2 & still didn't work? Something extra odd there as I've succeeded with those driver & cuda versions on other machine. ))

** m:10921 Host 36803 - 8x 4090 driver 535.54.03 cuda 12.2 = NO, error "Site Can't Be Reached".

** m:8335 - 8x 4090 driver 545.23.08 cuda 12.3 = NO, error "Site Can't Be Reached".

Is this something we're going to deal with forever? Just a lotto guessing game trying to find compatible Hosts which always change/get broken because of driver incompatibility ?? Why is it worse right now than ever before? Also doesn't seem like driver version is the only thing at play here, as how can it be if 2 Hosts identical driver & cuda yet one works & other doesn't.

I've spoke with Vast Support at their site & Hosts at Vast Discord but it seems no one can just get this working in a more reliable way. What else can we try?

Thanks for your time.

ehfd commented 11 months ago

This is a combined issue of two different things: Version 535 drivers under 535.129.03 do not work and it's an NVIDIA driver fault. Use or allocate on <= 530.xx or >=535.129.03. https://github.com/selkies-project/docker-nvidia-glx-desktop/issues/41

And you also need a TURN server (this is self-service unlike Parsec or Reemo!). https://github.com/selkies-project/docker-nvidia-glx-desktop/issues/17

theShiZa commented 11 months ago

Hi ehfd. Thanks for your quick reply.

Out of 14 Hosts & counting, I've only managed to get into 2. They were both 535.129.03 cuda 12.2, but at the same time I've had Hosts that are also 535.129.03 cuda 12.2 but I can't connect to them. As you can see from my quote post. So what's happening there? Another factor is at play. Can you be more specific? What do you mean by NVIDIA driver fault? As In missing drivers? Wrong drivers? Or bugged drivers?

And you also need a TURN server (this is self-service unlike Parsec or Reemo!). https://github.com/selkies-project/docker-nvidia-glx-desktop/issues/17

Can you talk to me like I'm a down-syndrome child please ? We're just artists trying to render our shit in a reliable way. It seems GLX Desktop users are the minority as Hosts couldn't care less for GLX Desktop compatibility. It's broken half the time so I don't blame them. All we know iis that it's gone WAY worse than it was a few months ago with the same GLX Desktop templates, will it ever be at a stage where it just works? Or is this something that we'll forever be plagued by?

I mean we're in 2023 & it's this much of a mess just to simply have a desktop we can look at? Is there a GLX Desktop alternative or is this the best humans can do at this stage? I've chatted & tested real-time with Hosts, explained to Vast.ai support but it's all just a guessing game.

Please give some nice steps explained with a bit more detail as to what to try next.

This is a pic of my default GLX Desktop template, can you please tell me if there's anything missing or wrong in it?

1112

My mission, as tedious as it is, was to make a list of successful machine & failed machines to know which driver version to stick to but that theory goes down the drain when I also fail on the same driver version that was successful on another Host. So I can't even rely on driver versions to be a reliable bet.

Appreciate your time.

ehfd commented 11 months ago

I understand your issue and thank you for raising this. Perhaps, docker-nvidia-egl-desktop would fit better in more universal environments.

NVIDIA broke the mechanism for the glx desktop on 535 drivers and fixed them later on...

The below is common to either the EGL and GLX desktop.

The TURN server requirement is that there are situations where the WebRTC peer-to-peer connection cannot be established because both the server and the client are behind some type of NAT (firewall). The TURN server helps the server connect with the client. Zoom, Parsec, and other services provide this TURN server themselves, but you need your own TURN server in this situation.

Specifying the TURN server solves this issue, and therefore you need your own TURN server.

https://github.com/selkies-project/selkies-gstreamer#using-a-turn-server

The above is a pretty comprehensive description of how you can set up your own coTURN server. You can possibly run this and expose both the frontend port (default is 3478) and the relay ports (a few or more ports with five digits, one per user) with Vast. You can also host your own coTURN server in any cloud VM you want.

https://www.metered.ca/tools/openrelay/

These people provide free TURN servers that you can try out to see if it solves most of the connection issues.

But the TURN server must be close to either the client or the server, or be in the path between both of them for optimal latency.

You may also set NOVNC_ENABLE to true for an interface without this requirement but with suboptimal performance. Set it to "Local Scaling" mode to fit the screen in this case.

Please ask me more questions after digesting the above.

It would be a good idea to use a way to set the NVIDIA driver version constant or use the EGL desktop. Vast might have this method.

jsbcannell commented 11 months ago

Thanks ehfd this is very useful.

Our current recommended template filters with drivers < 535.XX, so I need to look into updating that.

As for the TURN server, most all machines on vast have open ports and this template also filter requires that. This template exposes internal port 8080 and runs the web streaming server on that open port so the TURN server should not be required? That seems to be the case as it works on many machines without it.

The other potential issue is any attached display, but we also are filtering for that as well.

ehfd commented 11 months ago

As for the TURN server, most all machines on vast have open ports and this template also filter requires that. This template exposes internal port 8080 and runs the web streaming server on that open port so the TURN server should not be required? That seems to be the case as it works on many machines without it.

@jsbcannell

The port 8080 is for the web interface and WebSockets for relaying WebRTC signalling information. The WebRTC protocol is another thing from that and requires more ports, typically in the five-digit range. Containers without Host Networking are technically behind a firewall. That's why the TURN server requirement tends to be more common.

I might be able to prepare a template for a working TURN server resource on VAST in some time if they provide some sponsorship resources (credits to work within the VAST cluster) to me.

I am also planning to enhance noVNC into KasmVNC.

theShiZa commented 11 months ago

Hi ehfd.

Thanks for taking the time to explain this to me. I'm good with PC's but definitely not an expert, so I'll try following the TURN server steps but not sure if it's past my level of retarded-ness.

I have heard of EGL Desktop & that it is more compatible but have read that it doesn't perform as well as GLX Desktop so I'm not keen to try EGL. When renting for example 8x + 4090's, the last thing I want is downgrade in performance. So I'll stick with GLX Desktop for now & see if I can get it to a reliable state by following your suggested steps.

I might be able to prepare a template for a working TURN server resource on VAST in some time if they provide some sponsorship resources to me.

Whatever that involves it sounds excellent but I'm sure this will be too long in the future to benefit us anytime soon.

When attempting to do your TURN server steps, should I still be only trying on Hosts with driver versions <= 530.xx or >=535.129.03 for now??

Thanks for your patience & help.

theShiZa commented 11 months ago

Thanks ehfd this is very useful.

Our current recommended template filters with drivers < 535.XX, so I need to look into updating that.

As for the TURN server, most all machines on vast have open ports and this template also filter requires that. This template exposes internal port 8080 and runs the web streaming server on that open port so the TURN server should not be required? That seems to be the case as it works on many machines without it.

The other potential issue is any attached display, but we also are filtering for that as well.

I'm assuming that's you VastAI Jake?! Thanks for your efforts on this.

ehfd commented 11 months ago

Please remind me on this in a couple days. Bit busy now.

ehfd commented 11 months ago

Whatever that involves it sounds excellent but I'm sure this will be too long in the future to benefit us anytime soon.

It's not that far away. If someone gives me some Vast credits, I can do it.

ehfd commented 11 months ago

When attempting to do your TURN server steps, should I still be only trying on Hosts with driver versions <= 530.xx or >=535.129.03 for now??

Yes, these are two separate things. Both are requirements.

ehfd commented 11 months ago

Basically, you should be prepared to expose a few TCP ports and a few UDP ports. I need that to configure a TURN server in Vast or RunPod or CoreWeave.

theShiZa commented 11 months ago

Hi ehfd.

I'm prepared to do anything at this point. How much roughly would "some" Vast credits be? Or is that too hard to estimate? Anyways I know you're busy so I'll leave the questions there for now till you get back from U.S.

Thanks for the info.

ehfd commented 11 months ago

@theShiZa Hey. My work in the US is relevant to this so don't worry about that.

One suggestion is to use an Oracle Cloud's free VM instance because I found that there are only GPU nodes in VAST. That would work pretty well as a free TURN server.

theShiZa commented 11 months ago

Hey ehfd. Hope everything is well for your work in U.S. Let me know if you need any help 😁

Well I was going to leave you alone for a while but you asked for it :)

What if I'll be uploading & downloading GB's or TB's of project files from the Vast Instance to/from Google Drive ? Won't that be costing money if I got a VPS service?

I was going to try using a spare laptop I have & set it up as a TURN server but I read the only way to get around my NAT is to set my laptop as DMZ from my router but that would mean it'd be very unsafe & not secure.

Can you tell me what you mean when you say "I found that there are only GPU nodes in VAST" ?? Just a quick explanation.

Thank You.

ehfd commented 11 months ago

What if I'll be uploading & downloading GB's or TB's of project files from the Vast Instance to/from Google Drive? Won't that be costing money if I got a VPS service?

You wouldn't need to do that. The TURN server only relays WebRTC, and it's definitely not a VPN. This is just for the relay ports in situations you can't open, and Oracle's free tier seems to be generous with egress.

I was going to try using a spare laptop I have & set it up as a TURN server but I read the only way to get around my NAT is to set my laptop as DMZ from my router but that would mean it'd be very unsafe & not secure.

This is another very good way, but you would need to port forward certain outside-facing ports like you said using DMZ or other methods.

I found that there are only GPU nodes in VAST

If VAST happened to have CPU-only instances or a way to provision a pod with only a small number of CPUs and a bit of RAM without GPUs, you wouldn't need a different VPN since the TURN server can be hosted this way. It would not be reasonable to allocate a GPU for a TURN server pod.

That said, perhaps there could be a non-TURN server approach where a port can just simply be exposed from the main container like Neko (such as https://neko.m1k1o.net/#/getting-started/configuration?id=webrtc) instead of requiring a TURN server.

theShiZa commented 10 months ago

Hi, thanks for getting back to me.

You wouldn't need to do that. The TURN server only relays WebRTC, and it's definitely not a VPN. This is just for the relay ports in situations you can't open, and Oracle's free tier seems to be generous with egress.

Ok that's good to know, I understand now.

This is another very good way, but you would need to port forward certain outside-facing ports like you said using DMZ or other methods.

I'm not familiar with methods other than DMZ, so I'll have to look into it.

If VAST happened to have CPU-only instances or a way to provision a pod with only a small number of CPUs and a bit of RAM without GPUs, you wouldn't need a different VPN since the TURN server can be hosted this way. It would not be reasonable to allocate a GPU for a TURN server pod.

That said, perhaps there could be a non-TURN server approach where a port can just simply be exposed from the main container like Neko (such as https://neko.m1k1o.net/#/getting-started/configuration?id=webrtc) instead of requiring a TURN server.

Is Neko like a GLX Desktop alternative?

Thanks for your time & info.

ehfd commented 9 months ago

Hey. Neko has some containers but in maximum it's an EGL desktop alternative for now. But I am willing to integrate more capabilities that Neko has to Selkies or even in the future merge them together.

ehfd commented 8 months ago

https://github.com/orgs/ai-dock/repositories

This container is based on my container and includes a TURN server. Perhaps this may work better on RunPod and VAST.ai, because both platforms are not my main platform.

ehfd commented 8 months ago

@robballantyne Speaking of which, are you interested in complementing each other? I don't use VAST or Runpod nor I am interested in maintaining a huge lineup of container appliances, so we could specialize in different things.

robballantyne commented 8 months ago

@ehfd Absolutely. My desktop container only exists because of your work.

I add quite a lot of other things because the target platforms allow one container per instance so it's unusual but necessary.

I think your gstreamer interface is incredible and I link back at every opportunity and I'll continue to do that as I add more appliances. I really appreciate what you've done.

ehfd commented 8 months ago

@robballantyne There's a Discord link on the top of README.md. If you aren't already there, we can talk there.