ultravioletrs / cocos

Cocos AI - Confidential Computing System for AI
https://ultraviolet.rs/cocos.html
Apache License 2.0
23 stars 9 forks source link

Feature: Connect Agent directly to NATS (or MQTT) broker #267

Open drasko opened 5 hours ago

drasko commented 5 hours ago

Is your feature request related to a problem? Please describe.

No

Describe the feature you are requesting, as well as the possible use case(s) for it.

Currently, Agent communicates logs via vsocket, and then the Manager multiplex multiple Agent's messages and forwards them to the Computations service.

However, there is a problem with robustness of this solution, especially when some of the VMs goes down and is restarted (not to mention possible saturation of multiplexed logs).

It would be better if Manager and Agent would be independent (stateless), and not connected via vsock. Each Agent could open a NATS connection to a broker running in the cloud (or on the private machine) and send logs over there. Additional advantage would be that Computations service and UI would have real-time updates from Agents to consume in the easy manner (via NATS topic). Agent can send a heartbeat for Computations to know that VM is running and is connected.

For this all we will need for Manager to pass to the Agent is JWT that will be used for connection to NATS, and we can pass this one via guest Linux kernel command line in Qemu, as this is written in the measurement. Eventually we will need also ComputationID, as we can use this for the topic to which Agent can subscribe to get other params for this computation: manifest, certificates, etc...

That way Manager and Agent will be completely independent and Manager can continue to monitor VMs only by external Qemu interfaces (it is important that Manager give the name to each VM that corresponds to ComputationID). We would remove complexity of vsock and internal Manager DB for recconections. We will have all information in UI when some of the Agents is disconnected, because of heartbeat.

If some of the compuations fail (VM down, Agent disconnected), we will not reboot automatically - we will let user to see and then restart the computation by hand if needed.

Additionally - as a simplification - we can remove gRPC completely and Manager could connect to NATS as well, which would additionally simplify implementation by technology unification.

As a side note (not so related to this topic, but would be good to have) - Manager can also send regular requests to Computations service to see what are running (active) VMs, and kill all other VMs (just in case that there is some VM hanging).

@dborovcanin @danko-miladinovic @SammyOina please tell me what you think about this improvement - this is a subject for discussion so that we can see how we can simplify and improve our implementation. I think that NATS connections would be easier to proxy and easier to load-balance (NATS has LB built-in) than gRPC.

Question will be how to do Authentication (only that Agent can connect) and Authorization (that Agent with that ComputationID can connect only to that topic) - for that we'll have to research NATS methods: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro#authentication-methods - or maybe we can use MQTT broker instead of NATS, as we have auth implemented for MQTT protocol (we just need to make sure that load-balancing i enabled in Verne MQTT v5.0).

Indicate the importance of this feature to you.

Must-have

Anything else?

No response

drasko commented 5 hours ago

One more remark - we will need also NATS IP in the kernel cmdline.

And while we are at it - should we also put manifest in the kernel command line, in order to have it in the measurement? I think this is not needed, and this is something that Agent should receive over NATS (from Computations), and then Agent can embed received manifest in every Attestation Report - although it is not a measurement, it is an additional metadata that RO and immutable Agent in CVM add to the attestation. That way client can also see with what manifest this computation is running, and users can be confident as Agent is open-source, and they inspected the code (and this Agent hash was in the measurement).