stakwork / sphinx-swarm

lightning container orchestration for massive deployments
4 stars 4 forks source link

ops issues #290

Open Evanfeenstra opened 1 month ago

Evanfeenstra commented 1 month ago

crashing ec2

ip addresses changing

better logs in swarm UI

Evanfeenstra commented 1 month ago

superadmin

tomsmith8 commented 1 month ago

@Evanfeenstra could you prioritise setting docker and container limits.

@kevkevinpal could you prioritise migrating the btc graph, updating the github actions pipeline and deprecating the non swarm ec2 instances

Next up then would be setting up cloud watch?

Evanfeenstra commented 1 month ago

just merged a per container memory limit, set it once and it applies to every container

https://github.com/stakwork/sphinx-swarm/commit/84ab2259b96e11dfc8e899639b518e53d012489c

Its global_mem_limit in the yaml config file, its a number in bytes

Evanfeenstra commented 1 month ago

@tobi-bams here's a new SetGlobalMemLimit cmd, maybe u can add a frontend for it? https://github.com/stakwork/sphinx-swarm/blob/master/src/cmd.rs#L152

tobi-bams commented 1 month ago

@tobi-bams here's a new SetGlobalMemLimit cmd, maybe u can add a frontend for it? https://github.com/stakwork/sphinx-swarm/blob/master/src/cmd.rs#L152

Yea, sure I can.

Evanfeenstra commented 1 month ago

log rotation: https://github.com/stakwork/sphinx-swarm/releases/tag/v0.4.98

tomsmith8 commented 1 month ago

Update all swarms to m5.large or higher.

Do not use t groups due to CPU credits and spikes causes machines to become unavailable.

tomsmith8 commented 1 month ago

@Evanfeenstra any updates on keeping logs?