ultravioletrs / cocos

Cocos AI - Confidential Computing System for AI
https://ultraviolet.rs/cocos.html
Apache License 2.0
25 stars 9 forks source link

Bug: Qemu fails - "Number of hotpluggable cpus" #217

Closed FilipCivljak closed 2 months ago

FilipCivljak commented 2 months ago

What were you trying to achieve?

To run computation for inference of Fraud Detection.

What are the expected results?

Training pass normally

What are the received results?

Agent logs report the error (in Prism UI):

{"time":"2024-08-26T15:35:57.245777686Z","level":"ERROR","msg":"qemu-system-x86_64: warning: Number of hotpluggable cpus requested (64) exceeds the recommended cpus supported by KVM (24)\n"}

Manager logs:

sudo MANAGER_GRPC_URL=164.90.178.85:7011 MANAGER_LOG_LEVEL=debug MANAGER_QEMU_USE_SUDO=true  MANAGER_QEMU_ENABLE_SEV=false MANAGER_QEMU_SEV_CBITPOS=51 MANAGER_QEMU_OVMF_CODE_FILE=/usr/share/OVMF/OVMF_CODE.fd MANAGER_QEMU_OVMF_VARS_FILE=/usr/share/OVMF/OVMF_VARS.fd MANAGER_QEMU_ENABLE_SEV_SNP=false MANAGER_GRPC_CLIENT_CERT=/home/ciki/cocos/certificates/cert.pem MANAGER_GRPC_CLIENT_KEY=/home/ciki/cocos/certificates/key.pem MANAGER_GRPC_SERVER_CA_CERTS=/home/ciki/cocos/certificates/ca.pem MANAGER_QEMU_MEMORY_SIZE=25G MANAGER_QEMU_HOST_FWD_RANGE=6100-6200 go run main.go
{"time":"2024-08-26T15:35:54.95482303Z","level":"INFO","msg":"-enable-kvm -machine q35 -cpu EPYC -smp 4,maxcpus=64 -m 25G,slots=5,maxmem=30G -drive if=pflash,format=raw,unit=0,file=/usr/share/OVMF/OVMF_CODE.fd,readonly=on -drive if=pflash,format=raw,unit=1,file=/usr/share/OVMF/OVMF_VARS.fd -netdev user,id=vmnic,hostfwd=tcp::7020-:7002 -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,addr=0x2,romfile= -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3 -vnc :0 -kernel img/bzImage -append \"earlyprintk=serial console=ttyS0\" -initrd img/rootfs.cpio.gz -nographic -monitor pty"}
{"time":"2024-08-26T15:50:47.248050276Z","level":"WARN","msg":"Method Run for computation took 14m50.06516489s to complete with error: dial vsock vm(3):9999: connect: connection reset by peer."}
{"time":"2024-08-26T15:50:47.248174543Z","level":"ERROR","msg":"manager service terminated: dial vsock vm(3):9999: connect: connection reset by peer"}
{"time":"2024-08-26T15:50:47.248338826Z","level":"ERROR","msg":"Error shutting down tracer provider: context canceled"}

Steps To Reproduce

No response

In what environment did you encounter the issue?

Dell AMD SEV-SNP machine

Additional information you deem important

No response

drasko commented 2 months ago

As suggested by @SammyOina:

https://github.com/ultravioletrs/cocos/tree/main/manager#configuration

MANAGER_QEMU_SMP_MAXCPUS