kasmtech / workspaces-issues

19 stars 5 forks source link

seeking to understand how to debug remote agent further - "No Agent slots available" / "No resources are available" #653

Open repudi8or opened 1 day ago

repudi8or commented 1 day ago

Existing Resources

Describe the bug Maybe misleading messaging??? anyhow when I try to start a workspace/terminal I get back a "no resources are available" in the UI.

To Reproduce Steps to reproduce the behavior:

  1. After clean restart of both Manager and agent (no existing sessions)
  2. in Dashboard, Go to 'workspaces'
  3. Click on 'terminal'
  4. See error "no resources are available"
  5. go to admin UI and infrastructure=>docker agents
  6. click edit on the only enabled agent
  7. see plenty of free resources (see attached screenshot)

Expected behavior A terminal session starts in browser

Screenshots Screenshot 2024-11-15 at 12 53 07 PM.

Workspaces Version e.g Version 1.16

Workspaces Installation Method manager was installed as single server (initial PoC) but has local agent (which works fine) currently disabled Agent has been installed following the Multi-Server agent install process (though in a slightly non-standard way in DinD)

Client Browser (please complete the following information):

Workspace Server Information (please provide the output of the following commands):

Server: Containers: 10 Running: 10 Paused: 0 Stopped: 0 Images: 12 Server Version: 27.3.1 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan kasmweb/sidecar:1.0 macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c runc version: v1.1.14-0-g2c9f560 init version: de40ad0 Security Options: apparmor seccomp Profile: builtin cgroupns Kernel Version: 6.8.0-1018-aws Operating System: Ubuntu 24.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 30.98GiB Name: ip-10-9-174-54 ID: 6a945648-eec6-4895-a99c-2f5e6cc2e931 Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled


 - `sudo docker ps | grep kasm`

5f0283d67025 kasmweb/proxy:1.16.0 "/docker-entrypoint.…" 6 days ago Up About an hour 80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp kasm_proxy 5254523b5551 kasmweb/rdp-https-gateway:1.16.0 "/opt/rdpgw/rdpgw" 6 days ago Up About an hour (healthy) kasm_rdp_https_gateway 3a6fb1d47210 kasmweb/share:1.16.0 "/bin/sh -c '/usr/bi…" 6 days ago Up About an hour (healthy) 8182/tcp kasm_share 058e021a51fe kasmweb/rdp-gateway:1.16.0 "/start.sh" 6 days ago Up About an hour (healthy) 0.0.0.0:3389->3389/tcp, :::3389->3389/tcp kasm_rdp_gateway 038f9c7feae9 kasmweb/agent:1.16.0 "/bin/sh -c '/usr/bi…" 6 days ago Up About an hour (healthy) 4444/tcp kasm_agent c8c97df61c7a kasmweb/api:1.16.0 "/bin/sh -c '/usr/bi…" 6 days ago Up About an hour (healthy) 8080/tcp kasm_api cee22c5e4624 kasmweb/kasm-guac:1.16.0 "/dockerentrypoint.sh" 6 days ago Up About an hour (healthy) kasm_guac b6494307109e kasmweb/manager:1.16.0 "/usr/bin/startup.sh…" 6 days ago Up About an hour (healthy) 8181/tcp kasm_manager 2a39f69bcfa1 redis:5-alpine "docker-entrypoint.s…" 6 days ago Up About an hour 6379/tcp kasm_redis 9cb2f82702bf postgres:14-alpine "docker-entrypoint.s…" 6 days ago Up About an hour (healthy) 5432/tcp


**Additional context**
Several things of note. 
1. When i try starting terminal in UI, I see no traffic to backend agent container via `docker logs -f`
2. I can curl from management host to agent host on 443 with a dummy path, I DO see logging at backend agent docker logs
    like `2024-11-14 23:35:00,818 [WARNING] tornado.access: 404 GET /test_path_to_agent (172.19.0.3) 0.31ms`
3. when i get docker logs of the kasmweb/api container on management host and try to start a terminal session I see:

2024-11-14 23:53:56,015 [DEBUG] client_api_server: Successfully authenticated request (wrapper_function) for user (admin@kasm.local) at (10.229.154.198) 2024-11-14 23:53:56,022 [DEBUG] client_api_server: License Check: Current Kasms (0) , License Limit (5) , Remaining (5) 2024-11-14 23:53:56,041 [DEBUG] client_api_server: Using group-level keepalive_expiration of (3600) 2024-11-14 23:53:56,041 [INFO] client_api_server: No existing containers with image (Terminal) exist to assign user (admin@kasm.local) 2024-11-14 23:53:56,042 [DEBUG] client_api_server: Function (provider_manager.assign_container) executed in (0.0067403316497802734) seconds 2024-11-14 23:53:56,043 [DEBUG] client_api_server: Getting existing available containers 2024-11-14 23:53:56,043 [DEBUG] client_api_server: Getting available slots for image: (kasmweb/terminal:1.16.0) 2024-11-14 23:53:56,050 [DEBUG] client_api_server: Processing Server: (f1682ddf-4a10-4e06-bc62-ab655f31d341) 2024-11-14 23:53:56,051 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,051 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,052 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,052 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,053 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,053 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,054 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,055 [DEBUG] client_api_server: Server (f1682ddf-4a10-4e06-bc62-ab655f31d341) can support this container. Adding slot 2024-11-14 23:53:56,055 [DEBUG] client_api_server: Server f1682ddf-4a10-4e06-bc62-ab655f31d341 does not have required cores remaining (0.0 remaining) 2024-11-14 23:53:56,056 [DEBUG] client_api_server: No more slots available for server : (f1682ddf-4a10-4e06-bc62-ab655f31d341) 2024-11-14 23:53:56,056 [DEBUG] client_api_server: Function (provider_manager.get_available_slots) executed in (0.01248025894165039) seconds 2024-11-14 23:53:56,057 [DEBUG] client_api_server: Prioritized slots ([None, None, None, None, None, None, None, None]) 2024-11-14 23:53:56,058 [DEBUG] client_api_server: Server limited prioritized slots ([None]) 2024-11-14 23:53:56,058 [DEBUG] client_api_server: Function (provider_manager.prioritize_slots) executed in (0.0016200542449951172) seconds 2024-11-14 23:53:56,067 [INFO] client_api_server: User groups: [<data.model.Group object at 0x7478a80a2370>, <data.model.Group object at 0x7478a1bd1a30>] 2024-11-14 23:53:56,067 [DEBUG] client_api_server: User-based storage mappings defined but not allowed via group settings 2024-11-14 23:53:56,110 [DEBUG] client_api_server: Function (provider_manager.is_host_alive) executed in (5.7220458984375e-06) seconds 2024-11-14 23:53:56,110 [DEBUG] client_api_server: Function (provider_manager.get_container) executed in (0.06757664680480957) seconds 2024-11-14 23:53:56,111 [ERROR] client_api_server: No resources are available to create the requested Kasm. Please try again later or contact an Administrator : No Agent slots available. No Agent can be contacted with enough available resources to provision the image



Some Questions relating to the above logs
1. it seems like 8 slots are being created before the message `No more slots available for server : (f1682ddf-4a10-4e06-bc62-ab655f31d341)` . is this correct?
2. if there are 8 `slots`, are these only abstractions on the management host side of how many sessions can be created on the agent side before resources are exhausted?
3. How can I drill more deeply into the error about `no resources available` to understand what, specifically,  Kasm thinks is not available that is required on the agent side, when the screenshot (supported by validation on the agent side) indicates plenty of resource?
mmcclaskey commented 1 day ago

The following line is misleading

Server f1682ddf-4a10-4e06-bc62-ab655f31d341 does not have required cores remaining (0.0 remaining)

It just means that the provisioning loop stopped making 'slots' on that server because it has no more cores to make empty slots from. It is meant to help you identify what was the limiting factor on the server for creating empty slots. It could have also stopped due to RAM or GPUs.

Something is wrong on the following two lines, there should be values none None.

2024-11-14 23:53:56,057 [DEBUG] client_api_server: Prioritized slots ([None, None, None, None, None, None, None, None])
2024-11-14 23:53:56,058 [DEBUG] client_api_server: Server limited prioritized slots ([None])

I suspect your agent does not belong to a Zone. Can you validate that your agent belongs to a zone and if it does, what is the Zone name?