Add submissions module - Githubissues

thomasyu888 commented 4 years ago

Get all the submissions objects submitted in the past. We could add an option to show only the submissions with a given status (e.g. Running, Completed), show only the last N submission (e.g. --head 5).

tschaffter commented 4 years ago

Thanks for creating this ticket.

tschaffter commented 3 years ago

@thomasyu888 Before answering https://github.com/nlpsandbox/nlpsandbox-controller/issues/39 (submission ID, run ID), let's discuss about how users can submit a NLP tool for training/evaluation. We currently don't support training but this is another discussion that we should have soon in https://github.com/nlpsandbox/nlpsandbox-controller/issues/24.

Let's imagine that we want to enable the user to submit a tool for evaluation (with optional training) using one command of the client. What should be the information that the user must provide to the client? Also, it should be possible to use this command or a similar one to submit as part of a CI/CD workflow.

Submission during CI/CD workflow would probably look like this:

the user creates a release of his/her tool
this trigger a GH workflow that lints, tests and publishes a Docker image to a Docker registry
if the docker image has been successfully published, the next job in the workflow submits the docker image to the NLP Sandbox for submission. This could be achieve with a command line of the client.

Submission request body:

Docker image repository + digest
If the docker image is private, an API key to access the image (Synapse API Key, DockerHub API Key)
The ID of the submission queue
The Synapse credentials (API Key) to associate the submission to the user's Synapse account

Response:

ID of the submission?
URL to get access to the status of the submission?

thomasyu888 commented 3 years ago

@tschaffter, there are multiple things you touched on here which I will address.

We could include or exclude docker push out of the submit command, but i think it's important for participants to know how to docker build on their own. One of the biggest issues is that you don't get a digest until you push the docker image.
All of the below will assume using Synapse as a backend for submissions. The reason for this is because would need to come up with just some of the following (if not more) to have a nlpsandbox-submission API.
- a robust team API so that people can submit as a team (and link their submission)
- Submission quotas
- Submission permissions / cacheing) currently when people submitting docker repos via Synapse, the admins to the evaluation queues will have full access to the submitted docker_repo+docker_digest and the submission is cached
- Probably others that I am missing.
The submission request body you specified looks fine, but if they are able to build and push an image, they won't need the API key to access the image (it'll have already been pushed). I also don't particularly like the credentials as the request body. The most basic requirements from a Docker submission is probably.
- docker image
- docker digest
- submitter name or id
- queue id

Using Synapse

Now that I have prefaced this, I will now talk about my solution using Synapse. There is already a command line function to submit the image. That being said, its really not as straightforward to use as it should, but currently I imagine the command can look something like this:

# This would be the entire workflow in the command line

## 1. Create synapse project, synapseclient will be installed due to nlpsandbox-cli dependency
synapse create Project --name "My Challenge Project"
# A synapse id will be returned from this, lets just say its syn12345

## 2. Log into docker.synapse.org

docker login docker.synapse.org
docker build -t docker.synapse.org/syn12345/my-model:latest .
docker push docker.synapse.org/syn12345/my-model:latest  

# A synapse id and a shadigest will be created for this image.  But I want the submit command to be:

## 3. Submit to evaluation queue
nlp-cli submit --docker_repo docker.synapse.org/syn111111/my-model:latest --projectId syn12345 --annotator_type {date, person, address} --teamId myteam_name

Step 3 will actually contain steps of its own:

Query through syn12345 for specified docker repo and tag (must specify the tag)
Obtain synapse id + shadigest that correlates to the specific docker repo+tag specified
Submit to evaluation queue

All of this can be achieved in a CI/CD workflow

Synapse steps simplified

Now that I have gone through the whole workflow, lets try to simplify it a bit and just have participants build their docker image

docker build -t my-model:my-tag .
nlp-sandbox submit --docker-image  my-model:my-tag --teamid team... --workspace_name

This submit function will essentially do:

Create a Synapse project for the team if it doesn't already exist (workspace name)
Push the docker repository into Synapse Project (obtain the digest)
Submit the docker repository (obtain synapse id + digest)
success

Not using Synapse.

Theres a lot I can say here, but I'll just touch on one key point - I think one of the main issues currently is that things are so tied to projects and projects have to be uniquely named. At the end of the day, it would be helpful for us to have a "copy" of the docker submission. So when people submit, I think its most efficient if they are "pushing" their docker image into a private registry that we already have access to but without having to have to create a workspace themselves. (They can specify a token)

tschaffter commented 3 years ago

We could include or exclude docker push out of the submit command, but i think it's important for participants to know how to docker build on their own. One of the biggest issues is that you don't get a digest until you push the docker image.

Pushing the docker image should be done by the user using Docker tools so that we keep the client as simple as possible, development and maintenance wise.

This approach also offers more flexibility to the user. Example: our example annotators already have a job in the GH CI/CD workflow to lint, test and publish the docker image to DockerHub. There are already GH Action that helps with pushing Docker images to registries. This workflow is standard. In order to submit the annotator for evaluation to the NLP Sandbox, the user should add one job to the CI/CD workflow that leverage the previous job (publication of the docker image).

The submission request body you specified looks fine, but if they are able to build and push an image, they won't need the API key to access the image (it'll have already been pushed)

The API key would be for us (the infrastructure) to pull a private image from DockerHub

Notes

I would be in favor of requesting submitter to always submit as part of a team, i.e. we would request individual user to create a team in order to submit. This is likely how the challenge platform will work as a team can be extended, while shifting from submitting as a single user to submitting as a team is cumbersome and not intuitive as done in Synapse. If we agree on this, a teamId could be one of the required property of the submission object.

thomasyu888 commented 3 years ago

Thanks @tschaffter, that is what I originally had in mind as well. I think participants should do the docker push, but I wanted to keep the options open. One thing that we will really need to tackle is the ability to pull down private images and cache the submission. What we are actually missing are scoped access tokens per private docker repository, I don't think we should collect API keys from participants, as these API keys can actually do more than just "get one specific docker repository". It would be a security concern.

Furthermore, I still haven't fully added support for Dockerhub repositories, but what we can do is to have participants submit a yaml or json file.

docker_image: my-model
docker_digest: sha25....

This complicates things a bit, because participants must also share with a service user (dockerhub/synapse docker registry) so that the specific service user can access it, whereas right now when people submit a docker repository on Synapse, the admins to the queue automatically have access to it, AND the docker repo+digest is cached "forever". What we can do to circumvent these issues is to:

invalidate docker images and digest that we don't have access to
tag the docker image and store it in our own docker registry so we have a "copy" of the submission

I am also in favor of having participants submit as a team, this would be the steps to getting started with teams:

register for the challenge
create a team
register the team to the challenge

I can invalidate all submissions that don't have a team id. I could even make a convenience function that creates or uses an existing team for users and auto registers that team to the challenge.

nlpsandbox / nlpsandbox-client

Add submissions module #25

Using Synapse

Synapse steps simplified

Not using Synapse.

Notes