nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 624 forks source link

Add support for Amazon CodeCommit #61

Closed pditommaso closed 2 years ago

pditommaso commented 9 years ago

CodeCommit is a code repository hosted on Amazon AWS.

The main benefit is that it allows large files to be stored on it (up to 2GB per file).

Configuration doc is available at this link.

API documentation is available at this link.

pditommaso commented 8 years ago

At this time this feature is not relevant. Closing for now.

wleepang commented 4 years ago

Curious if there is interest in re-opening this issue and targeting accessing AWS CodeCommit using git-remote-codecommit over HTTPS as documented here:

https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html

This would allow users to run workflows in a privately hosted AWS CodeCommit repo via:

nextflow run codecommit::<region>://<repository-name> ...

This would be particularly useful for running Nextflow on "batch-squared" architecture where the master nextflow job uses temporary credentials from the host instance or job role.

pditommaso commented 4 years ago

Don't remember exactly which technical issue I've found adding the support for CodeCommit. Does it support HTTP/S transport?

wleepang commented 4 years ago

Yes, CodeCommit supports both SSH and HTTPS transport. For HTTPS, there are a couple options:

pditommaso commented 4 years ago

A pull request is welcome for this feature. The GitHub implementation can be used as a reference for it

https://github.com/nextflow-io/nextflow/blob/b49e6d74dae782def6be2b087d06fdd8fa6b9291/modules/nextflow/src/main/groovy/nextflow/scm/GithubRepositoryProvider.groovy#L27-L78

jcurado-flomics commented 4 years ago

+1! I was also just looking for this. It would be great to have it

wleepang commented 4 years ago

@pditommaso - looking through the base provider, and wanted to clarify - does Nextflow primarily use HTTP/S API calls to get repo files? Could there be an option to rely on an installed version of git?

pditommaso commented 4 years ago

NF uses the provider REST API to fetch the clone URL and read the pipeline config file(s) remotely i.e. without cloning it.

Once it fetched the clone url, it pulls the project using the embedded Git client:

https://github.com/nextflow-io/nextflow/blob/55be0762950184114877519fdb810afc199478f4/modules/nextflow/src/main/groovy/nextflow/scm/AssetManager.groovy#L571

Therefore don't think an installed version of git would help a lot. Not sure CodeCommit has a REST API to fetch repository metadata. If I remember well this was the problem I met when trying to implement the support for it.

wleepang commented 4 years ago

While you can you use a REST API with CodeCommit, you would need to handle the Sigv4 signing process to access it. These actions are encapsulated by the AWS SDK. So actions to get file contents and read branches would be SDK calls.

pditommaso commented 4 years ago

Therefore it needs to be authenticated via the AWS SDK for Java, that looks feasible.

wleepang commented 4 years ago

Is there somewhere else besides scm/ProviderConfig.groovy that needs to have the provider registered? I've got the AWS CodeCommit provider coded up, but the build doesn't recognize it. Instead, I get the following:

Unknown repository provider: `codecommit`'. Did you mean?
  github
  gitlab
  gitea
  bitbucket
wleepang commented 4 years ago

In scm/ProviderConfig.groovy I have the following:

    static private void addDefaults(List<ProviderConfig> result) {
        if( !result.find{ it.name == 'github' })
            result << new ProviderConfig('github')

        if( !result.find{ it.name == 'gitlab' })
            result << new ProviderConfig('gitlab')

        if( !result.find{ it.name == 'gitea' })
            result << new ProviderConfig('gitea')

        if( !result.find{ it.name == 'bitbucket' })
            result << new ProviderConfig('bitbucket')

        if( !result.find{ it.name == 'codecommit' })
            result << new ProviderConfig('codecommit')
    }

And in scm/RepositoryProvider.groovy I have the following:

    static RepositoryProvider create( ProviderConfig config, String project ) {
        switch(config.platform) {
            case 'github':
                return new GithubRepositoryProvider(project, config)

            case 'bitbucket':
                return new BitbucketRepositoryProvider(project, config)

            case 'bitbucketserver':
                return new BitbucketServerRepositoryProvider(project, config)

            case 'gitlab':
                return new GitlabRepositoryProvider(project, config)

            case 'gitea':
                return new GiteaRepositoryProvider(project, config)

            case 'codecommit':
                return new AwsCodeCommitRepositoryProvider(project, config)

            case 'file':
                // remove the 'local' prefix for the file provider
                def localName = project.tokenize('/').last()
                return new LocalRepositoryProvider(localName, config)
        }

        throw new AbortOperationException("Unkwnon project repository platform: ${config.platform}")
    }
wleepang commented 4 years ago

I figured it out - didn't see the instructions to run launch.sh after make compile.

pditommaso commented 2 years ago

It's time to resurrect this integration. There's already a proof of concept provided by @wleepang.

The remaining part is to integrate it with the Nextflow plugging mechanisms, since being a Git provider for AWS it should make part of nf-amazon plugin.

On side of the challenge is the support for Git providers currently does not use the nextflow plugin system. The RepositoryProvider should be refactored to use the plugins extension mechanism.

A second problem is that plugins required by the pipeline execution are configured and loaded after the pipeline is fetched is from the Git repo. This means that the AWS plugging should be loaded ahead before the canonical plugins setup here.

We had a similar problem for the remote file system loading, which may require the use of cloud plugin even if it has not been explicitly declared in the pipeline config.

The solution was to implement a plugin autostart mechanism when a remote file scheme is detected. See here

Likely the same approach can be implemented for the Git providers. /cc @jorgeaguileraseqera

pditommaso commented 2 years ago

Added a tentative implementation here. (saw only now the WIP, hope there isn't too much overlap)

pditommaso commented 2 years ago

Solved by 80fba6e9 and 296d7add