neuro-inc / neuro-flow

Execution engine for scripts and pipelines
https://neu-ro.gitbook.io/neuro-flow/
Other
16 stars 2 forks source link

The job in live mode is assumed as running, if it is running in a different cluser #496

Closed YevheniiSemendiak closed 2 years ago

YevheniiSemendiak commented 3 years ago

STR:

  1. Use the NF project description with live mode tasks.
  2. Being in cluster A, launch non-multi task.
  3. Switch to a different cluster and try to launch the same task. Result: NF fails with the error "the task is already running" image

Expected: you might launch the same tasks on different clusters

romasku commented 3 years ago

Hello, this is not a neuro-flow error, but a basic neuro API. Let me explain what is happening by example:

neuro run --name somename ...
neuro config switch-cluster ...
neuro run --name somename ...

So you are trying to create multiple jobs with the same name. As job name is interchangeable with job-id, so you can use it non-cluster specific commands such as neuro status, it is forbidden to have multiple running jobs with the same name even if they are in different clusters.

I think this error happened because we recently added auto-generated names (--name) for live jobs. To resolve your issue we can:

So I do not see any 100% good solution here. I would probably prefer (2) as this is how it works for disks. Any thoughts?

YevheniiSemendiak commented 3 years ago

Agree with your thoughts, a unique identification of the named job is set of name, owner, (tenant in future), cluster and control plane (the control plane might just be skipped and become implicit). Variant 2 sounds reasonable. Regarding the compatibility - we might discuss within the team and decide, whether it worth doing it atm. As for me, it is not a blocker, or whatever, just some sort of "weird thing".

YevheniiSemendiak commented 3 years ago

Other though: if we are trying to enrich the neuro-flow's collaboration capabilities, my suggestion might be a step in the opposite direction: into the global scope, instead of project scope. For NF this logic should probably be different. Summoning MLOps team @anayden and @mariyadavydova, WDYT in context of this issue?

anayden commented 3 years ago

First off, I imagine this issue will not be ever seen by 99% of our users, as they stick to one cluster.

Option (1) seems bad because this will create job hostnames with duplicate cluster string: jobname--username--clustername.jobs.clustername.org.neu.ro and make useful part of the job name even shorter than it is now.

As far as "moving to global scope" idea is concerned, I don't fully understand what it means in this context.

YevheniiSemendiak commented 3 years ago

As far as "moving to global scope" idea is concerned, I don't fully understand what it means in this context.

Exactly require the uniqueness of triple name, owner, (tenant in future), cluster.

YevheniiSemendiak commented 3 years ago

@romasku I have another observation of improper behavior in this context. STR:

  1. Go under examples/demo-jobs folder in this repo.
  2. In live.yml, develop job definition remove port forwarding config.
  3. Run this job under user A
  4. In a separate console, switch to user B, which has access to jobs of user A.
  5. Run the same job Result: Job develop is running, connecting... Expected: New job instance is running

Example: image

YevheniiSemendiak commented 2 years ago

Duplicates with https://github.com/neuro-inc/neuro-cli/issues/2442