Added indexed option for k8s jobs

sjawhar commented 2 years ago

Purpose

Parallel K8s jobs have two modes: Indexed and NonIndexed (default). Indexed jobs have some nice abilities, for example each job gets a JOB_COMPLETION_INDEX environment variable so it could assign itself work without needing a separate task queue. There's currently no way to specify the k8s completion mode.

Approach

Added an optional boolean indexed attribute

Example when set to false or ommitted:

Example when set to true

TODO

[ ] Add tests
[ ] Update docs

dacbd commented 2 years ago

Can you provide a main.tf you tested with/can be tested with/example?

Also thanks for the contribution!

sjawhar commented 2 years ago

Can you provide a main.tf you tested with/can be tested with/example?

First, create a persistent volume with ReadWriteMany. For example:

kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-storage
  labels:
    type: local
spec:
  capacity:
    storage: 4Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /tpi-data

Then use this main.tf

terraform {
  required_providers {
    iterative = {
      source = "github.com/iterative/iterative"
    }
  }
}
provider "iterative" {}

resource "iterative_task" "example" {
  cloud     = "k8s"
  machine   = "s"
  disk_size = 1

  parallelism = 3
  indexed     = false

  storage {
    workdir = "."
    output  = "results"
  }
  script = <<-END
    #!/bin/bash

    mkdir -p results
    printenv > results/out-$JOB_COMPLETION_INDEX
  END
}

casperdcl commented 2 years ago

related #585 /CC @0x2b3bfa0

sjawhar commented 2 years ago

Awesome feature, @sjawhar! heart_eyes It would be nice to enable IndexedCompletion automatically when parallelism is greater than 1 instead of exposing a separate option that has no equivalent on other providers.

Thanks for the quick review! I also didn't like having an option that only applies to k8s, but I don't quite agree with your suggestion. Let me try to explain, though bear with me because I'm coming from a very cloud-heavy background and am relatively new to k8s.

K8s has two ways of running parallel jobs: indexed and nonindexed. Nonindexed means the jobs are identical, start up N, they do their thing. Indexed means that k8s expects a success for at least one job of each index 1-N. In other words, if job 3 fails, it'll keep retrying 3 specifically. They're useful for different things: you might use nonindexed to spawn N workers to accomplish M jobs by reading from a task queue, whereas you might use indexed to accomplish the same thing without the queue (e.g. by giving the full task list to each job, and then it assigns itself work based on its index).

tl;dr: both indexed and nonindexed are useful ways of running parallel jobs in k8s, and I don't think the user should be forced to use one or the other.

Now here's an extra twist: to accomplish the indexed example I described above (N workers for M jobs, where M > N), we need to be able to set .spec.completions and .spec.parallelism separately. So we actually need to add another k8s-specific option :sweat_smile:.

Thoughts on all this?

UPDATE: What if we removed indexed and added completions, and set CompletionMode to indexed if completions is set. That way there's only one k8s-specific option :smiley_cat:

dacbd commented 2 years ago

if we could make index less k8s specific like somehow populate an env like TASK_COMPLETION_INDEX = [1-3] basically reproducing the k8s JOB_COMPLETION_INDEX behavior for the cloud instances 🤔

sjawhar commented 2 years ago

I added the completions functionality as described in my last comment. This just reflects what works for my use case, I'm happy to tear apart and refactor to make more generic, as you have both described. Also, please forgive my Go, this might be precisely the second time I've ever used it :sweat_smile:

0x2b3bfa0 commented 2 years ago

For the use cases we[^1] have in mind, the following should be enough:

CompletionMode equals to NonIndexed if parallelism is 1 else Indexed
Completions equals to parallelism
Parallelism equals to parallelism

The following scenarios are still not supported:

Run $m$ different tasks on $n$ nodes, where $m \neq n$
Run $n$ identical workers taking tasks from an external queue

Out of curiosity, can you share a bit more of information on your use case? I guess that you may want to restrict the parallelism so running a massively parallel task doesn't exhaust your cluster's resources. 🤔

[^1]: 🔔 @iterative/cml for sanity

sjawhar commented 2 years ago

CompletionMode equals to NonIndexed if parallelism is 1 else Indexed

I think you still might not be grokking the purpose of non/indexed. They're both valid modes of parallel operation, it's not about parallelism = 1 or not.

Out of curiosity, can you share a bit more of information on your use case? I guess that you may want to restrict the parallelism so running a massively parallel task doesn't exhaust your cluster's resources. 🤔

You got it. If I need to run 100 things, I don't want to spawn 100 pods. This seems to me like a necessary consideration for supporting on-prem/k8s.

As you can see from the diff, this is a very minor change that opens up a lot of functionality in k8s. I'd be grateful if you'd reconsider the pros/cons here. 🙏

0x2b3bfa0 commented 2 years ago

If I need to run 100 things, I don't want to spawn 100 pods

What you propose is similar to jobs.<job_id>.strategy.max-parallel on GitHub Actions and could be useful, although supporting it outside k8s is significantly more complex.

As per your suggestion, the current parallelism argument should be replaced by two different arguments:

Number of times to run a task (i.e. completions in the Kubernetes parlance)
Maximum number of concurrent workers (i.e. parallelism in the Kubernetes parlance)

By default, 2 should be equal to 1 if not specified by users.

0x2b3bfa0 commented 2 years ago

I think you still might not be grokking the purpose of non/indexed. They're both valid modes of parallel operation, it's not about parallelism = 1 or not.

Sorry, I explained myself badly. 🙈

Traditionally, Kubernetes Jobs didn't have an Indexed mode, and were meant to run exactly the same code one or more times. Using an external task queue was the officially recommended way of running different tasks.

This project tries to simplify the user's workflow as much as possible: the Indexed mode eliminates the need for an external task queue in those cases when the total number of tasks is known in advance, but still allows users to differentiate every index.

My insistence of using NonIndexed mode when parallelism is 1 is just to support Kubernetes versions prior 1.24 if users don't need this feature. Requiring a single completion renders this feature useless, so we can safely disable it; when requiring more completions, it will be always enabled.

omesser commented 2 years ago

@iterative/cml So, assuming: CompletionsMode can be inferred implicitly, and going with the latest suggestion.

How do we want to handle something like (user provides k8s specific field when not using k8s):
```
resource "iterative_task" "example" {
cloud     = "aws"
...
parallelism = 3
completions = 5
}
```
Error: "completions not supported for non-k8s run-time"? (I am allergic to referring to k8s as a cloud provider 😓 ) We definitely don't have to support a task-queue like logic at this point IMO. We're stepping dangerously into "implementing scheduler" territory.
Considering the above, I have to point out the very likely fact that more and more k8s-specific switches will probably be requested and added in the future, and maybe some cloud-provider specific ones as well - some that it won't be natural to abstract away or support across providers neatly. My personal opinion is that we need to keep those as true as possible to the original behavior (=same name, some provider/run-time specific block? maybe nested under "advanced"?) so users can easily infer their behavior from the original docs without inventing ad-hoc abstractions for them that aren't natural. Thoughts on this?

casperdcl commented 2 years ago

introducing a k8s-specific config block makes sense, esp. since it's not-a-cloud:

resource "iterative_task" "example" {
  cloud       = "k8s"
  parallelism = ... # warn iff defined on `cloud = "k8s"`?

  # k8s-specific block
  k8s {
    parallelism = 3 # in the k8s sense, not in the TPI sense
    completions = 5
  }
}

0x2b3bfa0 commented 2 years ago

I'd avoid introducing backend-specific details, especially when the outer parallelism is functionally equivalent to the inner completions and the inner parallelism has no equivalent in other backends.

0x2b3bfa0 commented 2 years ago

@sjawhar, would it be possible to keep the original scope of this pull request (enable indexed mode when Kubernetes completions > 1) and open a separate pull request for the addition of a separate parallelism limit?

sjawhar commented 2 years ago

@sjawhar, would it be possible to keep the original scope of this pull request (enable indexed mode when Kubernetes completions > 1) and open a separate pull request for the addition of a separate parallelism limit?

I really don't think that's a good idea. Users should not be forced to use indexed mode. I went that route initially just to make it work for my use case as quickly as possible, but this would be both a change from the current behavior and unnecessarily limiting. Are you sure you want to do that?

0x2b3bfa0 commented 2 years ago

this would be both a change from the current behavior and unnecessarily limiting

I thought that enabling the indexed mode doesn't have any significant effect beyond exposing a couple of additional environment variables. Which kind of unnecessary limitations do you foresee?

sjawhar commented 2 years ago

I thought that enabling the indexed mode doesn't have any significant effect beyond exposing a couple of additional environment variables. Which kind of unnecessary limitations do you foresee?

It's not just about environment variables. From the docs:

NonIndexed (default): the Job is considered complete when there have been .spec.completions successfully completed Pods. In other words, each Pod completion is homologous to each other.

Indexed: the Pods of a Job get an associated completion index from 0 to .spec.completions-1. The Job is considered complete when there is one successfully completed Pod for each index.

These are two very different behaviors. Your suggestion work perfectly well for my use case (Indexed is what I need), but it seems unnecessarily limiting. The spec already has provider-specific flags (e.g. region does not apply to k8s), so we're not breaking new ground by adding another one.

Nonetheless, if you're sure then I'll make the change.

0x2b3bfa0 commented 2 years ago

Your suggestion work perfectly well for my use case (Indexed is what I need), but it seems unnecessarily limiting.

Indexed mode is precisely the kind of behavior this tool is meant to have; see #585. Unless there is any practical use case that requires the NonIndexed mode, it looks like there is no pressing need to support both.

The spec already has provider-specific flags (e.g. region does not apply to k8s), so we're not breaking new ground by adding another one.

Ouch! 😅 Indeed, although there were plans to “rename region to location so it's valid for zones and node selectors” as per https://github.com/iterative/terraform-provider-iterative/issues/412#issue-1152159459

omesser commented 2 years ago

@0x2b3bfa0

Indexed mode is precisely the kind of behavior this tool is meant to have; see https://github.com/iterative/terraform-provider-iterative/issues/585. Unless there is any practical use case that requires the NonIndexed mode, it looks like there is no pressing need to support both.

I'm with @sjawhar on this one - it makes no sense taking away this choice for the perceived "simplicity" of merely saving a field here (1:1 with k8s field, explainability & docs built in). If we will get here the power of NonIndexed for the case of k8s only - that's still valuable IMO. No need to abuse terminology or over generalize - this should be in a k8s-specific block imo (yes, exposing backend specific details), no need to try and provide same functionality/choice across all other clouds/engines - that would be a broken abstraction. The key to keep our code powerful and still simple, meaningful and maintainable, IMO, is this - once something is not easily/directly generalizable - keep it provider specific, help the user transition to k8s terminology

"rename region to location so it's valid for zones and node selectors"

I'd advise against 😨 (commented on #412 )

0x2b3bfa0 commented 2 years ago

the power of NonIndexed

Citation needed. 😅 If we provide “powerful” options it should be for a reason, methinks.

0x2b3bfa0 commented 2 years ago

this should be in a k8s-specific block imo (yes, exposing backend specific details), no need to try and provide same functionality/choice across all other clouds/engines - that would be a broken abstraction.

We may end up exposing backend-specific details when shoehorning them under a common specification is proven to be difficult or impractical. 👍🏼

Although I wonder if this is the case.

0x2b3bfa0 commented 2 years ago

The key to keep our code powerful and still simple, meaningful and maintainable, IMO, is this - once something is not easily/directly generalizable - keep it provider specific, help the user transition to k8s terminology

When terminology is [susceptible of being] generalizable and different backends overload the same names with different meanings, I believe that it makes sense to run the extra mile and “generalize” on our side a bit more.

I'd argue that this particular functionality should be generalized, although in some other cases (e.g. placement) we end up resorting to specific options.

0x2b3bfa0 commented 2 years ago

taking away this choice for the perceived "simplicity" of merely saving a field here

Taking away some choices isn't implicitly negative; quite the opposite. Unless we find any supported use case that benefits particularly from NonIndexed mode, I'd consider not exposing that choice to users.

omesser commented 2 years ago

We discussed this offline a bit, and specifically - I find it hard to defend the choice for NoneIndexed, except for the religious reason of "give me ma k8s options!" So I would say we can live without it for now

sjawhar commented 2 years ago

Ok, I'll make the changes sometime in the next few days

redabuspatrol commented 2 years ago

I came across this PR while looking for this feature with AWS EC2. I think the ability to operate parallel instances with regular cloud providers and have some sort of indexing, or any mechanism, to dispatch work to the different instances can greatly help small teams and individual developers who don't have resources to manage k8s.

0x2b3bfa0 commented 2 years ago

@redabuspatrol, can you please comment on #585? In the meanwhile, you may also want to try having a separate task for every chunk of work.

sjawhar commented 2 years ago

Sorry for the long delay. I've reverted most of the changes, and implemented @0x2b3bfa0 's preferred solution.

0x2b3bfa0 commented 2 years ago

Thank you very much, @sjawhar! 🙏🏼

casperdcl commented 2 years ago

Thanks so much @sjawhar! Sorry for the delay and (perhaps excessive) hesitancy from our end too.

The reverted changes still look worth pursuing to us, just in separate PRs. We'd probably also be much quicker at understanding & merging said PRs if you could share an example of their use too... I assume related to things like https://github.com/iterative/terraform-provider-iterative/compare/master...sjawhar:terraform-provider-iterative:feature/nfs-volume and https://github.com/iterative/dvc/compare/main...sjawhar:dvc:feature/parallel-repro?

sjawhar commented 2 years ago

I can open separate issues for each of the following features

Ability to set completions separately from parallelism for k8s jobs (the feature that was rolled back in this PR)
Ability to mount NFS volumes to k8s jobs
Using the region field to set node selectors for k8s jobs
A fix for how controller-uid is queried in k8s jobs. The current method breaks when you add custom tags/labels to the jobs.

From the links above, you'll see I've already implemented all of these in the feature/nfs-volume branch, which I'm using for a project at work. Would be great to get all of them upstream, and I think the only sticking point is how to represent cloud-specific options in Terraform file.

sjawhar commented 2 years ago

Broke out the issues/fixes

iterative / terraform-provider-iterative