This document describes how to deploy the infrastructure needed to evaluate the performance of an NLP Tool submitted to the NLP Sandbox.
One of the feature of the NLP Sandbox is the ability for NLP developers to submit their NLP Tool once and then have it evaluated on multiple Data Hosting Sites.
The figure below represents how the infrastructure deployed on a Data Hosting Site evaluates the performance of a tool submitted by a developer to the NLP Sandbox.
The submission workflow is composed of these steps:
RECEIVED
submission, the Orchestrator will start running a
workflow with the submission as its input. The steps to the workflow is outlined
in workflow.cwl.
To be a NLP sandbox data hosting site, the site must be able to host 4 main technology stacks via Docker. Here are the requirements:
docker-compose
: ver 1.25.5 or higher (docker compose
is a built in function only for environments that have Docker desktop: mac, windows for Docker 20.10.6+)Ideally for performance, the Data Node, Synapse Workflow Orchestrator and ELK are hosted on different servers (e.g. ec2 instances), but these can technically be deployed on one server/machine.
We recommend this service to be run on a machine that at least has 2 CPU, 8GB ram and 300GB hard drive space
git clone https://github.com/nlpsandbox/data-node.git
cd data-node
cp .env.example .env
docker-compose up -d
scripts
directory repository is for Sage Bionetworks only. Please use this script to push an example dataset.
# set up conda or pipenv environment
pip install nlpsandbox-client
git clone https://github.com/nlpsandbox/nlpsandbox-client.git
# cd to the ~/nlpsandbox-client directory
# change the line `host = "http://localhost:8080/api/v1"` to point to your host http://yourhost.com:8080/api/v1
vi examples/push_dataset.py
# Downloads and pushes challenge data
python examples/push_dataset.py
dataset_id
should be made up of {dataset_name}-{dataset_version}. We recommend the dataset_version to be the date that it was created. An example of this would be sagedataset-20201125
. The fhir_store_id
must be evaluation
and the annotation_store_id
must be goldstandard
docker run nlpsandbox/cli:4.1.1 datanode list-datasets --data_node_host <your.datanode.ip>/api/v1
We recommend this service to be run on a machine that at least has 4 CPU, 16GB ram and 300GB hard drive space. View Submission workflow for what this tool does.
docker network create --internal submission
git clone https://github.com/Sage-Bionetworks/SynapseWorkflowOrchestrator.git
cd SynapseWorkflowOrchestrator
cp .envTemplate .env
and configure. Sage Bionetworks uses the service account nlp-sandbox-bot
and these EVALUTION_TEMPLATES
, but these will be different per data hosting site.
SYNAPSE_USERNAME=nlp-sandbox-bot # The data hosting site will have to created their own synapse service account.
SYNAPSE_PASSWORD=
EVALUATION_TEMPLATES={"queueid": "syn25585023"} # The queueid will be provided to the site by Sage Bionetworks. syn25585023 is the internal workflow synapse id.
WORKFLOW_OUTPUT_ROOT_ENTITY_ID=synid # This value will be provided to the site by Sage Bionetworks.
# WES_ENDPOINT=http://localhost:8082/ga4gh/wes/v1 # This needs to be commented
docker-compose up -d
docker volume create portainer_data
docker run -d -p 8000:8000 -p 9000:9000 --name=portainer --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer-ce
docker-compose.yaml
. The ROUTE_URIS
will be different from the Sage Bionetworks
site. We recommend the ELK service to be run on a machine that at least has 4 CPU, 16GB ram and 300GB hard drive space.
logspout:
image: bekt/logspout-logstash
restart: on-failure
environment:
- ROUTE_URIS=logstash://10.23.60.253:5000 # Only for Sage Bionetworks
- LOGSTASH_TAGS=docker-elk
volumes:
- /var/run/docker.sock:/var/run/docker.sock
Where 10.23.60.253
is the IP Address of your external ELK Server
A solution to track Docker container logs are a requirement to be a data hosting site. The reason for this is because the tool services submitted by participants are hosted as Docker containers and if there are issues with the service, the logs will have to be returned to participants. We suggest using ELK stack (instructions below), but there are plenty of other methods you can use to capture Docker logs.
git clone https://github.com/nlpsandbox/docker-elk.git
cd docker-elk
elastic passwords
in each of these locations:
docker-compose.yml
kibana/config/kibana.yml
logstash/config/logstash.yml
logstash/pipeline/logstash.conf
docker-compose -f docker-compose.yml -f extensions/logspout/logspout-compose.yml up -d --build
You will have to add logspout to the SynapseWorkflowOrchestrator
if running the services on different machines.
kibana
port in the docker-compose.yml
or else there is a chance that you will run into port already allocated
error.
ports:
- "80:5601" # Change 80 to an open port
logstash-*
as your index pattern and click "I don't want to use the time filter".git clone https://github.com/nlpsandbox/date-annotator-example.git
cd date-annotator-example
If running all services on one machine: must make sure port
is changed to avoid port already allocated
error.
ports:
- "80:80" # Change the first 80 to an open port
Start the service
docker-compose up -d
Get example notes:
nlp-cli datanode list-notes --data_node_host http://0.0.0.0/api/v1 --dataset_id 2014-i2b2-20201203-subset --fhir_store_id evaluation --output example_notes.json
Annotate notes:
nlp-cli tool annotate-note --annotator_host http://0.0.0.0:8080/api/v1 --note_json example_notes.json --tool_type nlpsandbox:date-annotator
The scoring is done as part of the workflow, but here are the steps to score submissions manually.
nlpsandbox-client
pip install nlpsandbox-client
nlp-cli datanode list-annotations --data_node_host http://0.0.0.0/api/v1 --dataset_id 2014-i2b2-20201203 --annotation_store_id goldstandard --output goldstandard.json
nlp-cli datanode list-annotations --data_node_host http://0.0.0.0/api/v1 --dataset_id 2014-i2b2-20201203 --annotation_store_id submission-111111 --output sub.json
nlp-cli evaluate-prediction --pred_filepath sub.json --gold_filepath goldstandard.json --tool_type nlpsandbox:date-annotator
The infrastructure is created through cloudformation templates. Important notes:
On top of the quota checking system that is built into annotate_note.py
, there has to be some safeguard for making sure that submissions quota the time quota are stopped. This is because the submission run time check happens within a for loop, it a docker run command happens to be stuck forever, the submission will never be deemed over the quota. There is a stop-submission-over-quota
function in challengeutils
, unfortunately, this function requires a submission view as input and there is a high likelihood that each queue could have a different runtime. Therefore, we will not be using this function.
python scripts/reject_submissions.py
This repository will host the CWL
workflow and tools required to set up the model-to-data
challenge infrastructure for NLP Sandbox
For more information about the tools, please head to ChallengeWorkflowTemplates
pip3 install cwltool
- id: dataset_name
type: string
default: "2014-i2b2" # change this
- id: dataset_version
type: string
default: "20201203" # change this
- id: api_version
type: string
default: "1.0.1" # change this
cwltool workflow.cwl --submissionId 12345 \
--adminUploadSynId syn12345 \
--submitterUploadSynId syn12345 \
--workflowSynapseId syn12345 \
--synaspeConfig ~/.synapseConfig
where:
submissionId
- ID of the Synapse submission to processadminUploadSynId
- ID of a Synapse folder accessible only to the submission queue administratorsubmitterUploadSynId
- ID of a Synapse folder accessible to the submitterworkflowSynapseId
- ID of the Synapse entity containing a reference to the workflow file(s)synapseConfig
- filepath to your Synapse credentialsRun scripts/push_data.py
# All data
python scripts/push_data.py syn25891742
# Subsetted data - example
python scripts/push_data.py syn25891740
For scheduled and unscheduled maintenance, the main queues should be closed so that participants won't be able to submit to them. To do so, run this script:
# This will revoke submit permissions for the NLP sandbox users
python scripts/toggle_queue.py close
# This will give submit permissions for the NLP sandbox users
python scripts/toggle_queue.py open