stakwork / sphinx-swarm

lightning container orchestration for massive deployments
4 stars 4 forks source link

ops issues #290

Open Evanfeenstra opened 2 months ago

Evanfeenstra commented 2 months ago

crashing ec2

ip addresses changing

better logs in swarm UI

Evanfeenstra commented 2 months ago

superadmin

tomsmith8 commented 2 months ago

@Evanfeenstra could you prioritise setting docker and container limits.

@kevkevinpal could you prioritise migrating the btc graph, updating the github actions pipeline and deprecating the non swarm ec2 instances

Next up then would be setting up cloud watch?

Evanfeenstra commented 2 months ago

just merged a per container memory limit, set it once and it applies to every container

https://github.com/stakwork/sphinx-swarm/commit/84ab2259b96e11dfc8e899639b518e53d012489c

Its global_mem_limit in the yaml config file, its a number in bytes

Evanfeenstra commented 2 months ago

@tobi-bams here's a new SetGlobalMemLimit cmd, maybe u can add a frontend for it? https://github.com/stakwork/sphinx-swarm/blob/master/src/cmd.rs#L152

tobi-bams commented 2 months ago

@tobi-bams here's a new SetGlobalMemLimit cmd, maybe u can add a frontend for it? https://github.com/stakwork/sphinx-swarm/blob/master/src/cmd.rs#L152

Yea, sure I can.

Evanfeenstra commented 2 months ago

log rotation: https://github.com/stakwork/sphinx-swarm/releases/tag/v0.4.98

tomsmith8 commented 2 months ago

Update all swarms to m5.large or higher.

Do not use t groups due to CPU credits and spikes causes machines to become unavailable.

tomsmith8 commented 2 months ago

@Evanfeenstra any updates on keeping logs?