Refresh the Spark workload (and other repairs)

mraygalaxy commented 1 month ago

The Spark workload was pretty old. spark-bench is no longer maintained. I didn't remove it, but I focused on GATK instead. We updated all the software as well as GATK and fixed all of the datasets that were required.
We introduced a new "medium" sized GATK workload (18GB) versus the small (70MB) vs. the large (150GB). This new medium size is a nice sweet spot.
We modified Spark to be better vertically scalable by make the worker node settings grow with the size of the VMs.
There were still python2 locations still being used in various places, so we fixed that.
The base container was still not updated to Ubuntu22... fixed that.
The linode driver had a bug where the VM hostnames were not being set. Fixed that.
The python-daemon package had to be updated as the Orchestrator Dockerfile was not working...fixed that too.
We also updated the Spark workload to use LOAD_LEVEL, which allows you to run multiple jobs simultaneously while also calculating the resources required to do so (horizontal scalability).

mraygalaxy commented 1 month ago

@maugustosilva Can you take a look?

maugustosilva commented 1 month ago

A very much needed update and fixup.

mraygalaxy commented 1 month ago

Thank you!

ibmcb / cbtool