iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link

parallel: id & examples #585

Open casperdcl opened 2 years ago

casperdcl commented 2 years ago
  1. Expose task index via environment variables, similar to:
  2. add minimal working example to docs using parallelism = 8, script = "... some_conditional_fork_and_join_code($TPI_PARALLEL_INDEX, $TPI_PARALLEL_TOTAL) ...
0x2b3bfa0 commented 2 years ago

NODE_INDEX & NODE_TOTAL — front-end versus back-end

front-end-vs-back-end-1

0x2b3bfa0 commented 2 years ago

Note that running different code on each instance is not easy: determining the node index requires a few orchestrator building blocks.

casperdcl commented 2 years ago

idk what you mean by different code. I'm talking about same code, different logic-branch owing to different env vars.

#script
index = os.environ.get('TPI_PARALLEL_INDEX', 0)
total = os.environ.get('TPI_PARALLEL_TOTAL', 1)

tasks = 1337
batch_size = int(math.ceil(tasks / total))
for step in range(index*batch_size, (index+1)*batch_size, tasks):
    do_work(step)
0x2b3bfa0 commented 2 years ago

I'm talking about same code, different logic-branch

Also known as “different code” or, in other words, function parallelism.

0x2b3bfa0 commented 2 years ago

PARALLEL_TOTAL is the same as parallelism and is straightforward to implement.

PARALLEL_INDEX is not straightforward to implement: it requires synchronization to avoid having several machines with the same index.

If you add this to “in progress”, expect me to spend a couple weeks doing what we're supposed to do two quarters from now; i.e. determine whether to reinvent the orchestrator[^1] or not and, if advisable, reinvent it.

[^1]: It always begins with Raft & Serf, and then you feel the need of adding a command-line tool, some extra supporting services... and you have an orchestrator, identical to the existing ones, but admittedly less elegant.

casperdcl commented 2 years ago

PARALLEL_INDEX is not straightforward to implement

Really? Argh. Backlogging.

0x2b3bfa0 commented 2 years ago

Note to future self: it's also possible to hack something with a cloud-managed atomic queue, popping items when instances boot and pushing them when they're about to terminate. 🤷🏼‍♂️

Another dodecagonal wheel.

0x2b3bfa0 commented 2 years ago

Another hacky possibility: two instance groups, one for the leader instance and other for the workers.

redabuspatrol commented 2 years ago

Re-commenting here for better context.

I came across this PR while looking for this feature with AWS EC2. I think the ability to operate parallel instances with regular cloud providers and have some sort of indexing, or any mechanism, to dispatch work to the different instances can greatly help small teams and individual developers who don't have resources to manage k8s.

Originally posted by @redabuspatrol in https://github.com/iterative/terraform-provider-iterative/issues/597#issuecomment-1183537070