hammerlab / secotrec

Setup Coclobas/Ketrew Clusters
Apache License 2.0
5 stars 6 forks source link

And the crazy idea about even better scaling and becoming fault tolerant #73

Open armish opened 7 years ago

armish commented 7 years ago

For the project I currently am working on, we have ~90 patients and the processing should have happened as fast as it can and a majority of the patients had to go through the complete pipeline (all :disco: features on), resulting in massive JSONs being passed around and computationally challenging equivalanece calculations to be done on the kserver side and eventually slowing it. And once it starts slowing down it all goes downhill after that, since the less it can handle, the more jobs queue up (this time risking them to hit HDD/MEM-related issues.

My solution was to spin up 5 (5!) separate secos and manually load the share against all of them. It worked nicely, but then you end up with 5 different clusters to check, tens of NFSeses to maintain and transfer files from/to, etc. Which makes me think that, maybe having seco/ketrew/psql on the head node, we should run them as services within the node pool and replicate them to match the scale of jobs they have to deal with.

And since the ketrews will be now inside the cluster, we can just get a simple NGINX-like thing up and running and make it responsible for the sharing the load (our kclients will submit to this balancer thinking that they are talking to a ketrew server and inside, each ketrew/seco setup will receive fewer and therefore more manageable tasks.

If we ever go down such a plan, we can even come up with a mock-ketrew server that pools results from the other instances inside the network and restart the failed tasks a few times until it start believing that there is an important issue with that particular job. And ideally, then it will pull up the information to its own level for us to investigate.

From what I understand from people's writings online and your status report from a few months ago, I don't think adopting a Swarm-like new and shiny technology might not help; but who know, maybe we can use this opportunity to write custom mirages and deal with these issues on our own :P

Hope that I am not re-discovering the wheel here, but if so let me know. I would love to read on this topic.