The job in live mode is assumed as running, if it is running in a different cluser

YevheniiSemendiak commented 3 years ago

STR:

Use the NF project description with live mode tasks.
Being in cluster A, launch non-multi task.
Switch to a different cluster and try to launch the same task. Result: NF fails with the error "the task is already running"

Expected: you might launch the same tasks on different clusters

romasku commented 3 years ago

Hello, this is not a neuro-flow error, but a basic neuro API. Let me explain what is happening by example:

neuro run --name somename ...
neuro config switch-cluster ...
neuro run --name somename ...

So you are trying to create multiple jobs with the same name. As job name is interchangeable with job-id, so you can use it non-cluster specific commands such as neuro status, it is forbidden to have multiple running jobs with the same name even if they are in different clusters.

I think this error happened because we recently added auto-generated names (--name) for live jobs. To resolve your issue we can:

(1) add cluster name to the auto-generated job. It will work, but it's a little bit ugly solution for me.
(2) require only uniqueness of (owner, name, cluster) triple instead of (owner, name) pair. This will require changes on the server-side that can potentially break old clients. I'm not sure it worth it.
(3) require the user to manually specify the name attribute in such cases. Not best UX as for me.

So I do not see any 100% good solution here. I would probably prefer (2) as this is how it works for disks. Any thoughts?

YevheniiSemendiak commented 3 years ago

Agree with your thoughts, a unique identification of the named job is set of name, owner, (tenant in future), cluster and control plane (the control plane might just be skipped and become implicit). Variant 2 sounds reasonable. Regarding the compatibility - we might discuss within the team and decide, whether it worth doing it atm. As for me, it is not a blocker, or whatever, just some sort of "weird thing".

YevheniiSemendiak commented 3 years ago

Other though: if we are trying to enrich the neuro-flow's collaboration capabilities, my suggestion might be a step in the opposite direction: into the global scope, instead of project scope. For NF this logic should probably be different. Summoning MLOps team @anayden and @mariyadavydova, WDYT in context of this issue?

anayden commented 3 years ago

First off, I imagine this issue will not be ever seen by 99% of our users, as they stick to one cluster.

Option (1) seems bad because this will create job hostnames with duplicate cluster string: jobname--username--clustername.jobs.clustername.org.neu.ro and make useful part of the job name even shorter than it is now.

As far as "moving to global scope" idea is concerned, I don't fully understand what it means in this context.

YevheniiSemendiak commented 3 years ago

As far as "moving to global scope" idea is concerned, I don't fully understand what it means in this context.

Exactly require the uniqueness of triple name, owner, (tenant in future), cluster.

YevheniiSemendiak commented 3 years ago

@romasku I have another observation of improper behavior in this context. STR:

Go under examples/demo-jobs folder in this repo.
In live.yml, develop job definition remove port forwarding config.
Run this job under user A
In a separate console, switch to user B, which has access to jobs of user A.
Run the same job Result: Job develop is running, connecting... Expected: New job instance is running

Example:

YevheniiSemendiak commented 2 years ago

Duplicates with https://github.com/neuro-inc/neuro-cli/issues/2442

neuro-inc / neuro-flow

The job in live mode is assumed as running, if it is running in a different cluser #496