nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 634 forks source link

Allow the concurrent run of multiple pipeline revisions #2870

Open pditommaso opened 2 years ago

pditommaso commented 2 years ago

Summary

Nextflow relies on built-in integration with Git to pull and run a workflow.

When the user specifies the Git repository URL on then run command line, Nextflow carry out a Git clone command, stores the pipeline code into the $HOME/.nextflow/assets directory and launch the execution from there.

When the user specifies the -r (revision) CLI option, the repository is checked out at the specified revision ie. branch, tag or even commit id.

This however poses a problem when if two or more users run different versions at the same time, because the last performing the operation would override the previous repository code, which could be a disruptive operation.

This is not such an unlikely event considering a pipeline execution can last for hours or even days.

To mitigate this problem nextflow refuses to perform a run if the project is currently checkout to a non-default version and the run does not specify the revision to be executed in an explicit manner. However, this is the cause of other unexpected side effects. See here.

Goal

The goal of this enhancement is to allow the concurrent use of multiple pipeline revision in the same computer and deprecated the need for the stick revision check.

This could be achieved by downloading the Git repository with bare clone instead of a normal clone, and checkout the work tree into a separate subdirectory named as the commit id associated with the specified revision.

For example, if the user runs

nextflow run https://github.com/nextflow-io/hello

nextflow should clone the repo above with the bare option and store in the path $HOME/.nextflow/assets/nextflow-io/hello.git

Then implicitly the default branch is checkout, therefore the associate commit should be retrieved e.g. 4eab81bd42eed592f4371cd91b755ec78df25fe9, therefore the following path should be created containing the work tree accessible for the execution

$HOME/.nextflow/assets/nextflow-io/hello.git/.nextflow/revs/4eab81bd42eed592f4371cd91b755ec78df25fe9

When the user-specified a different revision e.g.

nextflow run https://github.com/nextflow-io/hello -r dev

A new subdirectory with the corresponding commit id should be created.

The commit id should be resolved against the local git clone, unless the -latest option is specified.

jorgeaguileraseqera commented 2 years ago

@pditommaso one question:

do we need to check if the repo is present in the asset directory with the "old" format (no bare) and in this case no use the bare feature? i.e. some kind of retro compatibility or we'll force to remove and recreate local repos

pditommaso commented 2 years ago

Good point. If already exists think should report a warning message maybe?

jorgeaguileraseqera commented 2 years ago

so, report with a warning message (maybe with some instructions to remove the current repo) and stop the command, right?

pditommaso commented 2 years ago

No I mean, show a warning message i.e. log.warn and do not stop. Usually nextflow only stops on error.

jorgeaguileraseqera commented 2 years ago

ah ok, so if I understand correctly we try to identify which kind of repo we are working on at startup

pditommaso commented 2 years ago

I see your point. In principle the bare should have been created when the feature has been enabled with a config flag or env variable, right?

If so, I think when this option does not match the repo format a warning should be reported

pditommaso commented 2 years ago

@jorgeaguileraseqera any ETA for this?

jorgeaguileraseqera commented 2 years ago

Hope to have in these days

(it's a little tedious due to the API rate limit breaks sometimes to run all tests )

pditommaso commented 2 years ago

Can you please open at least a draft PR asap?

pditommaso commented 2 years ago

due to the API rate limit breaks sometimes to run all tests

Do you mean Github rate limits? Are you using your GITHUB_TOKEN for tests?

jorgeaguileraseqera commented 2 years ago

yes, I've created one and configured the env to run the tests

pditommaso commented 2 years ago

Weird, but for such tests it should not depends on GitHub. It can created a small test repos and then use it for testing.

There's something similar for testing Git submodules

notestaff commented 2 years ago

Implementing the functionality in this issue would also solve issue #2655 .

Maybe also, clarify in the issue title that "concurrent run" is only for runs from different working directories (with different work/ and .nextflow subdirs).

ewels commented 2 years ago

Maybe also, clarify in the issue title that "concurrent run" is only for runs from different working directories (with different work/ and .nextflow subdirs).

Note that we're talking about the NXF_HOME folder (~/.nextflow), not the hidden .nextflow folder in the launch directory here.

pditommaso commented 1 year ago

We lost the momentum with this feature :/

lukbut commented 1 year ago

Hi! This was recently brought to my attention. Just flagging that this would likely impact our engineers who might be developing on different feature branches but on the same workflow repo, on our development environments (which currently only run on our on-prem infrastructure).

pditommaso commented 1 year ago

Impacting in a good or bad way?

lukbut commented 1 year ago

Hi @pditommaso impact in a bad way, I'm afraid! Our current idea for developing workflows within our organisation is for engineers to have their own branch in a workflow repository. They would implement changes in their own branch, and potentially run said workflows on our on-prem infrastructure to test their implementations. I believe that due to this bug, the engineers would end up over-writing each other's workflow implementations, if multiple implementation of the same workflow are tested at the same time?

pditommaso commented 1 year ago

Understand, but it's not a bug. Nextflow has always worked in this way. The goal of this issue is exactly to overcome this limitation

leonorpalmeira commented 1 year ago

Don't know how the solution to this issue will be implemented, but don't forget (see https://github.com/nextflow-io/nextflow/issues/2655#issuecomment-1232941807) the use case where a developer has their own repository (outside of Nextflow's built-in integration of pull and run commands) and switches between branches during the execution of a pipeline. The solution to this issue should be that the execution shouldn't be affected by modifications of the original repository. Thanks :-)

lukbut commented 1 year ago

Hi! This issue has just come up again at Genomics England as it is likely that our engineers would want to run different branches of the workflows simultaneously. Is there any chance that this is getting implemented soon?

bentsherman commented 1 year ago

Hey Luke, we are planning to implement this but no set timeline yet.

pditommaso commented 1 year ago

Indeed, it is something to prioritize. Tagging @marcodelapierre for visibility

marcodelapierre commented 1 year ago

Paolo I have found a git functionality for this.

Let's bash code:

# for ease of description
ROOT_DIR="/path/to/.nextflow/assets"
repo="nextflow-io/hello"
revision="rocket"
def_remote="origin"

# user
nextflow run $repo -r $revision

# behind the scenes

# only if revision is not there already
if [ ! -d $ROOT_DIR/$repo/$revision ] ; then

# first revision requested
if [ ! -d $ROOT_DIR/$repo ] ; then
  mkdir -p $ROOT_DIR/$repo/first
  git clone -b $revision https://github.com/$repo $ROOT_DIR/$repo/first
  cd $ROOT_DIR/$repo/first
  def_branch=$( git remote show $def_remote | sed -n '/HEAD branch/s/.*: //p' )
  cd ..
  mv first $def_branch
  ln -s $def_branch first_branch

# additional revision
else 
  cd $ROOT_DIR/$repo/first_branch
  git worktree add --track -b $revision ../$revision $def_remote/$revision
fi

fi

The key functionality is this one:

git worktree add --track -b dsl2 ../dsl2 origin/dsl2 

Docs: https://git-scm.com/docs/git-worktree

Found here: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously And also here: https://stackoverflow.com/questions/6270193/how-can-i-have-multiple-working-directories-with-git/30185564#30185564

What do you think?

If you like it, I can give it a shot myself, soon after I have worked on another couple of pending work items.

marcodelapierre commented 1 year ago

Forgot to mention the key advantage: only the repo file tree is duplicated, whereas all the Git related files such as in .git/ exist only once

ewels commented 1 year ago

Never too old to learn a new git subcommand 😆

marcodelapierre commented 1 year ago

Never too old to learn a new git subcommand 😆

indeed! this is a clear mark of our young ages...!! 😂

marcodelapierre commented 1 year ago

@pditommaso keen on your take on my proposed solution before I work on the implementation

pditommaso commented 1 year ago

This is indeed an excellent idea. This could simplify the solution compared to the use of the bare repository approach.

Using the worktree solution, the main/master checkout should remain in the current location. Instead, when -r <revision> is requested it should be created a new work-tree under the path $NXF_ASSETS/revisions/<unique-id>, where unique-id is computed as sipHash24 of Project URI + revision.

Likely use of the --detach flag can also be useful.

marcodelapierre commented 1 year ago

maybe this path for non-master revsions: $NXF_ASSETS/revisions/$repo/$revision

marcodelapierre commented 11 months ago

Apol @pditommaso , had to prioritise other activities with larger customer impact.

I am keen to get this one done, on top of my list for when I am back in January.

notestaff commented 11 months ago

Ideally, the worktree should be checked out with all submodules recursively cloned, or there should be an option to do so. But if this complicates things, can be left for a later release.

Thanks a lot for working on this!

marcodelapierre commented 10 months ago

Working on it. Turns out that the eclipse.jgit project we currently rely on does not support git worktree; there is a [PR (https://bugs.eclipse.org/bugs/show_bug.cgi?id=477475), that has been open for years to only add support to manage existing worktrees, not even to create new ones.

Proposed steps for way forward:

  1. start implementing just a change in clone directory structure and repo management, so that multiple revisions by a repo are supported;
  2. consider whether to polish cloned pipelines from .git to save disk space;
  3. if relevant, explore alternatives to jgit (if any) that have wider git support (the main advantage of worktree is indeed avoiding the .git duplicates, so I don't think this step needs exploring).

At this stage, I believe 1. can already be good enough. In its basic implementation it would duplicate the .git files; however, is a local collection of revisions of a pipeline very much different from one of multiple pipelines?

So, going to proceed with 1. to begin with.

farshadf commented 3 months ago

Just double checking if this feature has been implemented. I cannot find a link to any doc clearly indicating this feature is now working. Thank you.