googlegenomics / dockerflow

Dockerflow is a workflow runner that uses Dataflow to run a series of tasks in Docker with the Pipelines API
Apache License 2.0
97 stars 17 forks source link

Authentication questions #4

Closed seandavi closed 8 years ago

seandavi commented 8 years ago

Thanks for the new project! This looks quite interesting. I wanted to give this a quick test and ran into the following problem. I have activated cloud dataflow API (and the others) and I thought that would allow me to run Dockerflow workflows. What did I miss?

java -jar target/dockerflow-0.0.1-SNAPSHOT-jar-with-dependencies.jar --project='sean-davis' --workflow-file=src/test/resources/linear-graph.yaml --workspace=gs://gbseqdata/dataflow
Sep 17, 2016 4:22:15 PM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Local working directory: /Users/sdavis2/Documents/git/dockerflow/.
Sep 17, 2016 4:22:15 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory pipelineOptions
INFO: Set up Dataflow options
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Creating workflow from file src/test/resources/linear-graph.yaml
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.workflow.WorkflowFactory load
INFO: Load workflow: src/test/resources/linear-graph.yaml
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: src/test/resources/linear-graph.yaml for class class com.google.cloud.genomics.dockerflow.workflow.Workflow
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /Users/sdavis2/Documents/git/dockerflow/src/test/resources/task-one.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /Users/sdavis2/Documents/git/dockerflow/src/test/resources/task-two.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 17, 2016 4:22:16 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Initializing dataflow pipeline
Sep 17, 2016 4:22:17 PM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 1 files. Enable logging at DEBUG level to see which files will be staged.
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating input collection of workflow args
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Merging default workflow args with instance-specific args
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating dataflow pipeline for workflow LinearGraph
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow LinearGraph
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding steps
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: stepOne
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: stepTwo
Sep 17, 2016 4:22:17 PM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Running Dataflow job LinearGraph
To cancel the individual Docker steps, run:
> gcloud alpha genomics operations cancel OPERATION_ID
Sep 17, 2016 4:22:17 PM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Sep 17, 2016 4:22:17 PM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements
INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Sep 17, 2016 4:22:17 PM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements
INFO: Uploading PipelineOptions.filesToStage complete: 0 files newly uploaded, 1 files cached
Dataflow SDK version: 1.7.0
Sep 17, 2016 4:22:18 PM com.google.cloud.dataflow.sdk.util.RetryHttpRequestInitializer$LoggingHttpBackoffUnsuccessfulResponseHandler handleResponse
WARNING: Request failed with code 403, will NOT retry: https://dataflow.googleapis.com/v1b3/projects/sean-davis/jobs
Exception in thread "main" java.lang.RuntimeException: Failed to create a workflow job: (54f6f1e3fc57c297): Could not create workflow; user does not have write access to project: sean-davis Causes: (54f6f1e3fc57cb72): Permission 'dataflow.jobs.create' denied on project: 'sean-davis'
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:637)
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:201)
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:180)
    at com.google.cloud.genomics.dockerflow.Dockerflow.main(Dockerflow.java:230)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "(54f6f1e3fc57c297): Could not create workflow; user does not have write access to project: sean-davis Causes: (54f6f1e3fc57cb72): Permission 'dataflow.jobs.create' denied on project: 'sean-davis'",
    "reason" : "forbidden"
  } ],
  "message" : "(54f6f1e3fc57c297): Could not create workflow; user does not have write access to project: sean-davis Causes: (54f6f1e3fc57cb72): Permission 'dataflow.jobs.create' denied on project: 'sean-davis'",
  "status" : "PERMISSION_DENIED"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:624)
    ... 3 more
jbingham commented 8 years ago

(Sorry I didn't see this until now.)

Have you tried running gcloud auth login to make sure you have a valid credential?

If yes, your default cloud project might be a different one from where you want to run Dockerflow. To change, you can run "gcloud init".

One of those ought to fix it. If not, maybe it's the bucket for your workspace that it's not able to write to.

vardaofthevalier commented 8 years ago

I am running into a similar authentication error, but I was under the assumption that I could just use the default service account for authentication since that's what I've been doing with all of my other pipeline requests thus far. Here's the full output that I got after attempting to run a particular workflow, with some paths to (potentially) sensitive data redacted:

Sep 22, 2016 12:32:35 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Local working directory: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/.
Sep 22, 2016 12:32:35 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory pipelineOptions
INFO: Set up Dataflow options
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Creating workflow from file germline-vc.yaml
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.WorkflowFactory load
INFO: Load workflow: germline-vc.yaml
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: germline-vc.yaml for class class com.google.cloud.genomics.dockerflow.workflow.Workflow
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/gdc-dl.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/samtools-idx.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/gatk.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/varscan.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/pindel.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/combine.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/vcf2bq.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Initializing dataflow pipeline
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating input collection of workflow args
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Merging default workflow args with instance-specific args
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating dataflow pipeline for workflow germline-vc
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow germline-vc
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a BRANCH
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Branch count: 3
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding steps
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Pipeline splits into branches. Adding branches
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Branch count: 3
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Merging 3 branches
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Running Dataflow job germline-vc
To cancel the individual Docker steps, run:
> gcloud alpha genomics operations cancel OPERATION_ID
Sep 22, 2016 12:32:36 AM com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner run
INFO: Executing pipeline using the DirectPipelineRunner.
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: Preparing to start task gdcDl for key args
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.task.Task resolvePaths
INFO: Resolving paths vs gs://germline-testing/dockerflow/gdcDl/
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: WorkflowArgs: {
  "basePath": "gs://germline-testing/dockerflow/${workflow.element}/",
  "runIndex": 0,
  "maxTries": 3,
  "abortOnError": true,
  "deleteIntermediateFiles": false,
  "resumeFailedRun": false,
  "projectId": "isb-cgc-data-02",
  "inputs": {
    "UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf",
    "JAVASIZE": "5",
    "DISK_SIZE": "100",
    "SERVICE_ACCT": "REQUIRED",
    "GDC_DATA_DESTINATION": <REDACTED>,
    "RESULTS_DESTINATION": <REDACTED>,
    "PROJECT_ID": "isb-cgc-data-02",
    "GDC_TOKEN_URI": <REDACTED>,
    "BAM_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam",
    "BAI_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam.bai",
    "REF_URI": "gs://germline-testing/inputs/GRCh37-lite.fa",
    "REF_IDX_URI": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
    "REF_DICT_URI": "gs://germline-testing/inputs/GRCh37-lite.dict",
    "REGIONSFILE_URI": "gs://germline-testing/inputs/regions.chr22.list",
    "RUN_GERMLINE_URI": "gs://germline-testing/inputs/run_germline.sh",
    "RUN_VCF2BQ_URI": "gs://germline-testing/inputs/vcf2gg2bq.sh",
    "VCF_URI": <REDACTED>,
    "gdcDl.UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf",
    "gdcDl.DISK_SIZE": "100",
    "gdcDl.GDC_TOKEN": <REDACTED>,
    "samtoolsIdx.DISK_SIZE": "100",
    "samtoolsIdx.BAM": <REDACTED>,
    "samtoolsIdx.BAM_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam",
    "samtoolsIdx.BAI_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam.bai",
    "gatk.DISK_SIZE": "100",
    "gatk.JAVASIZE": "5",
    "gatk.GSTOREDIR": <REDACTED>,
    "gatk.BAM": <REDACTED>,
    "gatk.BAI": <REDACTED>,
    "gatk.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
    "gatk.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
    "gatk.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
    "gatk.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
    "gatk.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
    "varscan.DISK_SIZE": "100",
    "varscan.JAVASIZE": "5",
    "varscan.GSTOREDIR": <REDACTED>",
    "varscan.BAM": <REDACTED>,
    "varscan.BAI": <REDACTED>,
    "varscan.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
    "varscan.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
    "varscan.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
    "varscan.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
    "varscan.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
    "pindel.DISK_SIZE": "100",
    "pindel.JAVASIZE": "5",
    "pindel.GSTOREDIR": <REDACTED>,
    "pindel.BAM": <REDACTED>,
    "pindel.BAI": <REDACTED>,
    "pindel.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
    "pindel.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
    "pindel.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
    "pindel.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
    "pindel.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
    "combine.DISK_SIZE": "100",
    "combine.JAVASIZE": "5",
    "combine.GSTOREDIR": <REDACTED>,
    "combine.GATK_ALL_SNP": <REDACTED>,
    "combine.GATK_ALL_INDEL": <REDACTED>,
    "combine.VARSCAN_ALL_SNP": <REDACTED>,
    "combine.VARSCAN_ALL_INDEL": <REDACTED>,
    "combine.PINDEL_ALL_INDEL": <REDACTED>,
    "combine.PINDEL_ALL_INDEL_FILTERED": <REDACTED>,
    "combine.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
    "vcf2bq.DATASET_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
    "vcf2bq.VCF_URL": "${combine.RESULTS}",
    "vcf2bq.BQ_TABLE_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
    "vcf2bq.BQ_DATASET_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
    "vcf2bq.PROJECT_ID": "isb-cgc-data-02",
    "vcf2bq.RUN_VCF2BQ": "gs://germline-testing/inputs/vcf2gg2bq.sh"
  },
  "outputs": {
    "gdcDl.DOWNLOADED": "${GDC_DATA_DESTINATION}",
    "samtoolsIdx.BAI": "${GDC_DATA_DESTINATION}/${BAI_FILENAME}",
    "gatk.ALL_SNP": "${RESULTS_DESTINATION}/${UUID}/gatk/gatk.all.snp.vcf.gz",
    "gatk.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/gatk/gatk.all.indel.vcf.gz",
    "varscan.ALL_SNP": "${RESULTS_DESTINATION}/${UUID}/varscan/varscan.all.snp.vcf.gz",
    "varscan.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/varscan/varscan.all.indel.vcf.gz",
    "pindel.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/pindel/pindel.all.indel.vcf.gz",
    "pindel.ALL_INDEL_FILTERED": "${RESULTS_DESTINATION}/${UUID}/pindel/pindel.all.indel.filtered.vcf.gz",
    "combine.RESULTS": "${RESULTS_DESTINATION}/${UUID}"
  },
  "logging": {
    "gcsPath": "gs://germline-testing/dockerflow/logs/${workflow.element}/task.log"
  },
  "resources": {},
  "serviceAccount": {
    "email": "default",
    "scopes": [
      "https://www.googleapis.com/auth/genomics",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/devstorage.full_control"
    ]
  }
}
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: Starting task
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.runner.TaskRunner runTask
INFO: Pipelines API request: {
  "ephemeralPipeline": {
    "name": "gdcDl",
    "description": "Download a file from the GDC given a file UUID and copy the resulting file to GCS",
    "projectId": "isb-cgc-data-02",
    "inputParameters": [
      {
        "name": "UUID"
      },
      {
        "name": "DISK_SIZE"
      },
      {
        "name": "GDC_TOKEN",
        "localCopy": {
          "disk": "gdc-data",
          "path": <REDACTED>
        }
      }
    ],
    "outputParameters": [
      {
        "name": "DOWNLOADED",
        "localCopy": {
          "disk": "gdc-data",
          "path": "1430729090-dockerflow"
        }
      }
    ],
    "resources": {
      "minimumCpuCores": "4",
      "minimumRamGb": "16",
      "preemptible": true,
      "zones": [
        "us-central1-a",
        "us-central1-b",
        "us-central1-c",
        "us-central1-f",
        "us-east1-b",
        "us-east1-c",
        "us-east1-d",
        "us-west1-a",
        "us-west1-b"
      ],
      "disks": [
        {
          "name": "gdc-data",
          "type": "PERSISTENT_SSD",
          "mountPoint": "/gdc-data",
          "sizeGb": "100"
        }
      ]
    },
    "docker": {
      "imageName": "b.gcr.io/isb-cgc-public-docker-images/gdc-client",
      "cmd": "cd /gdc-data \u0026\u0026 gdc-client download -t $GDC_TOKEN ${UUID}"
    }
  },
  "pipelineArgs": {
    "projectId": "isb-cgc-data-02",
    "inputs": {
      "DISK_SIZE": "100",
      "GDC_TOKEN": <REDACTED>,
      "UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf"
    },
    "outputs": {
      "DOWNLOADED": <REDACTED>
    },
    "logging": {
      "gcsPath": <REDACTED>
    },
    "resources": {
      "zones": []
    }
  }
}
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.runner.TaskRunner callAsyncWebService
INFO: Call Pipelines API.
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: com.google.cloud.genomics.dockerflow.runner.TaskException: Error starting Docker task gdcDl. Cause: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
  "error": {
    "code": 401,
    "message": "The request does not have valid authentication credentials.",
    "status": "UNAUTHENTICATED"
  }
}
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:186)
    at com.google.cloud.genomics.dockerflow.Dockerflow.main(Dockerflow.java:230)
Caused by: com.google.cloud.genomics.dockerflow.runner.TaskException: Error starting Docker task gdcDl. Cause: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
  "error": {
    "code": 401,
    "message": "The request does not have valid authentication credentials.",
    "status": "UNAUTHENTICATED"
  }
}
    at com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask.processElement(DockerDo.java:248)
Caused by: java.io.IOException: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
  "error": {
    "code": 401,
    "message": "The request does not have valid authentication credentials.",
    "status": "UNAUTHENTICATED"
  }
}
    at com.google.cloud.genomics.dockerflow.util.HttpUtils.doPost(HttpUtils.java:90)
    at com.google.cloud.genomics.dockerflow.runner.TaskRunner.callAsyncWebService(TaskRunner.java:169)
    at com.google.cloud.genomics.dockerflow.runner.TaskRunner.runTask(TaskRunner.java:86)
    at com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask.processElement(DockerDo.java:241)
    at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
    at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:138)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateHelper(ParDo.java:1229)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateSingleHelper(ParDo.java:1098)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.access$300(ParDo.java:457)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:1084)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:1079)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:858)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:219)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
    at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:102)
    at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:259)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:814)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:526)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:96)
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:180)
    at com.google.cloud.genomics.dockerflow.Dockerflow.main(Dockerflow.java:230)

It looks like maybe the service account info is not making it into the individual pipeline requests? I see that the info is in the pipelineArgs section of the workflow but not for the individual steps...

jbingham commented 8 years ago

This looks like a different error from Sean's, since you're able to start the pipeline successfully. That's progress!

I think I've seen the error you got (401, Unauthenticated) when running using DirectPipelineRunner when I closed my laptop or it went to sleep during the middle of the pipeline run, and local Dataflow lost internet access to make the web service calls.

Have you tried running Dataflow itself in the cloud with the default runner (either omit the --runner option or use --runner=DataflowPipelineRunner)?

If yes, can you share the command-line call, or email me privately with enough info that I can try reproducing? Thanks!

jbingham commented 8 years ago

And FYI that Dockerflow uses an access token for API calls and GCS access. The code is this:

String token = com.google.api.client.googleapis.auth.oauth2.GoogleCredential.getApplicationDefault().getAccessToken();

You can test that it works with a GCS bucket you have access to, like this:

curl https://storage.googleapis.com/MY-BUCKET/MY-PATH?access_token=MY-TOKEN

where MY-BUCKET/MY-PATH is the GCS path (without the gs:// prefix) and MY-TOKEN is obtained with the code above.

If this ends up as a common enough problem, I can create a super-simple command-line to check the access token only.

vardaofthevalier commented 8 years ago

Cool, thanks for the info about the access token. I was able to get it working after running gcloud auth login on my instance, with and without the "--runner" option. I've actually just submitted my first job using Dataflow in the cloud -- the execution graph in the Cloud Console is pretty cool!

jbingham commented 8 years ago

Great, glad that it's working!

The cool execution graph is the whole reason I wrote Dockerflow :-)

seandavi commented 8 years ago

Another gcloud auth login did the trick for me. I agree with @vardaofthevalier, this is really cool. @jbingham, do you know the scale of this system? I have workflows with >10k tasks. Would this be expected to work?

jbingham commented 8 years ago

In theory it should work. Things to know if you're running O(10k) concurrent tasks:

The Pipelines API will queue your work if you don't have enough cores of quota.

It's recommended to provide more zones, like "us-*", so you can spread out work more.

Dockerflow will abort by default if any of the individual 10k tasks doesn't complete. You can pass the flag --abort=false to turn this off (I'll add it to the --help message; just realized it's not documented. It's also not tested yet, so lmk if it doesn't work right.)

Otherwise, I'm looking forward to hearing how it works for you!

seandavi commented 8 years ago

These are great points, @jbingham. The 10k tasks will not be concurrent necessarily due to dependencies, but hundreds may be running simultaneously. Is there a place where I can look at the various quotas that might impact a dockerflow run, particularly with respect to cores, disk, and memory?

jbingham commented 8 years ago

The main quotas are for Compute Engine (cores, disk, IP addresses). You can check and increase them here: https://console.cloud.google.com/compute/quotas

seandavi commented 8 years ago

Perfect.