inab / WfExS-backend

Workflow Execution Service Backend
Apache License 2.0
16 stars 6 forks source link

Fix fetching from git #50

Closed dcl10 closed 1 year ago

dcl10 commented 1 year ago

Description

Previously, WfExS could only pull from specific remote sources; namely github, bitbucket and gitlab. However, self-hosted git repos may required in some instances. This PR removes the specificity and treats git as git.

This PR removes the capability to use raw.githubusercontent.com URLs to guess the repo. This is github specific and inherently fragile, because branches with slashes (/) in their names break this functionality. Use workflows as described by PyPI in workflow_id.

It is still possible to pull remote files individually using these urls, though.

jmfernandez commented 1 year ago

Dear @dcl10 , previously WfExS was able to pull from most of the custom scenarios (for instance https://github.com/inab/WfExS-backend/blob/0878f9eef933f848ebfc78ce73118b40d39c7346/workflow_examples/wetlab2variations_execution_nxf_giturl.wfex.stage#L1 ). It is also true that it supports the pseudo "github" scheme (as you can see at https://github.com/inab/WfExS-backend/blob/0878f9eef933f848ebfc78ce73118b40d39c7346/workflow_examples/hello/hello_cwl.wfex.stage#L1) , from the suggestions of @stain in past BioHackEU2002.

I like some of the changes you have proposed through the pull request, like the test battery. But I cannot accept them because they can break other scenarios, and there are cases where the newly supported URLs are ambiguous.

For instance, ssh://git@github.com:inab/WfExS-backend is not a standard ssh URI, because the colon is used to separate the host name from the SSH port declaration, as you can see at https://www.ietf.org/archive/id/draft-salowey-secsh-uri-00.html#anchor8 (and I would expect it is a "raw" repository, which no version control). So, the code should accept at most git+ssh://git@github.com/inab/WfExS-backend, following the rules from https://pip.pypa.io/en/stable/topics/vcs-support/

The same happens to file:///inab/WfExS-backend/.git#subdirectory=workflow_examples/ipc/cosifer_test1_cwl.wfex.stage, this should not be considered a git repository as the local directory should only be considered just that, a directory. The code should accept as a git repository git+file:///inab/WfExS-backend@0.1.2, as git command line accepts as cloning source other already pulled git repositories in the form of directories.

With your permission, I'm going to take the bits and ideas of implemented code from your pull request, and I'm going to integrate them in a separate branch, so you can revise them.

jmfernandez commented 1 year ago

The only URLs I'm still allowing as repository links are https ones from repos, due historical reasons: bio.tools and WorkflowHub workflow contributors are used to tell their workflows through web URLs instead of more reliable permanent identifiers, as they are not used to them (and they are not accepted by command line tools like git)

beforan commented 1 year ago

@jmfernandez I take your point about some of the ambiguous URLs, such as ssh URLs with a colon where that should be used to denote port, or pointing at the .git subdirectory.

Surely wfexs should support anything that the git client would, with something to denote that the intent is the URL points at a git repo. Then the pip convention of git+<URL> should suffice, and complying with the same behaviours as pip is desirable.

From our perspective, we just need a way to run a workflow from an airgapped environment. We do not have the luxury of pointing at github.com which is obviously by far the most common usage, because we cannot access the public internet.

We had hoped that git+ URLs would allow us to checkout any git repository, whether locally on disk, or in a locally run service, e.g. GitLab running at http://localhost:8080.

However we found that currently wfexs reports an error when attempting to checkout from those URLs, as an "unsupported http(s) repository source" or similar.

The intent (though perhaps not the result) of this PR is to resolve that issue, allowing working support for git+file://, git+http://, git+https:// and git+ssh:// URLs regardless of hostname (which currently appears to only allow github.com and gitlab.com, with comments suggesting the addition of bitbucket.org in future...)

Can we work together to make that functionality work? Or otherwise make execution of a workflow from a local path e.g. file:///absolute/path/to/my/workflow.cwl work?

We are blocked until we have working workflow execution from a workflow source that is not a TRS endpoint, or a supported public git service (github, gitlab...), but the changes included in this PR unblocks us.

Obviously appreciate your input as you know the codebase much better than us, and the ramifications of any changes beyond what we see as fixes!

beforan commented 1 year ago

example workflow URLs we tried which failed with errors prior to the changes in this branch looked as follows:

jmfernandez commented 1 year ago

Hi, I'm going to merge your contributions in a new branch, so I can remediate some possible internal side effects