Closed seandavi closed 8 years ago
(Sorry I didn't see this until now.)
Have you tried running gcloud auth login to make sure you have a valid credential?
If yes, your default cloud project might be a different one from where you want to run Dockerflow. To change, you can run "gcloud init".
One of those ought to fix it. If not, maybe it's the bucket for your workspace that it's not able to write to.
I am running into a similar authentication error, but I was under the assumption that I could just use the default service account for authentication since that's what I've been doing with all of my other pipeline requests thus far. Here's the full output that I got after attempting to run a particular workflow, with some paths to (potentially) sensitive data redacted:
Sep 22, 2016 12:32:35 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Local working directory: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/.
Sep 22, 2016 12:32:35 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory pipelineOptions
INFO: Set up Dataflow options
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Creating workflow from file germline-vc.yaml
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.WorkflowFactory load
INFO: Load workflow: germline-vc.yaml
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: germline-vc.yaml for class class com.google.cloud.genomics.dockerflow.workflow.Workflow
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/gdc-dl.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/samtools-idx.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/gatk.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/varscan.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/pindel.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/combine.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.util.FileUtils parseFile
INFO: Parse file from path: /home/ahahn/bioinformatics-pipelines/germline-vc/dockerflow/vcf2bq.yaml for class class com.google.cloud.genomics.dockerflow.task.TaskDefn
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Initializing dataflow pipeline
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating input collection of workflow args
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Merging default workflow args with instance-specific args
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Creating dataflow pipeline for workflow germline-vc
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow germline-vc
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a BRANCH
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Branch count: 3
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Get subgraph
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps subgraph
INFO: Subgraph is a single node named: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Creating graph for workflow vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.workflow.Workflow$Steps graph
INFO: Add workflow to graph: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding steps
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: gdcDl
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: samtoolsIdx
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Pipeline splits into branches. Adding branches
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Branch count: 3
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: gatk
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: varscan
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Adding branch
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: pindel
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory branches
INFO: Merging 3 branches
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: combine
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.dataflow.DataflowFactory dataflow
INFO: Adding task: vcf2bq
Sep 22, 2016 12:32:36 AM com.google.cloud.genomics.dockerflow.Dockerflow main
INFO: Running Dataflow job germline-vc
To cancel the individual Docker steps, run:
> gcloud alpha genomics operations cancel OPERATION_ID
Sep 22, 2016 12:32:36 AM com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner run
INFO: Executing pipeline using the DirectPipelineRunner.
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: Preparing to start task gdcDl for key args
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.task.Task resolvePaths
INFO: Resolving paths vs gs://germline-testing/dockerflow/gdcDl/
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: WorkflowArgs: {
"basePath": "gs://germline-testing/dockerflow/${workflow.element}/",
"runIndex": 0,
"maxTries": 3,
"abortOnError": true,
"deleteIntermediateFiles": false,
"resumeFailedRun": false,
"projectId": "isb-cgc-data-02",
"inputs": {
"UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf",
"JAVASIZE": "5",
"DISK_SIZE": "100",
"SERVICE_ACCT": "REQUIRED",
"GDC_DATA_DESTINATION": <REDACTED>,
"RESULTS_DESTINATION": <REDACTED>,
"PROJECT_ID": "isb-cgc-data-02",
"GDC_TOKEN_URI": <REDACTED>,
"BAM_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam",
"BAI_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam.bai",
"REF_URI": "gs://germline-testing/inputs/GRCh37-lite.fa",
"REF_IDX_URI": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
"REF_DICT_URI": "gs://germline-testing/inputs/GRCh37-lite.dict",
"REGIONSFILE_URI": "gs://germline-testing/inputs/regions.chr22.list",
"RUN_GERMLINE_URI": "gs://germline-testing/inputs/run_germline.sh",
"RUN_VCF2BQ_URI": "gs://germline-testing/inputs/vcf2gg2bq.sh",
"VCF_URI": <REDACTED>,
"gdcDl.UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf",
"gdcDl.DISK_SIZE": "100",
"gdcDl.GDC_TOKEN": <REDACTED>,
"samtoolsIdx.DISK_SIZE": "100",
"samtoolsIdx.BAM": <REDACTED>,
"samtoolsIdx.BAM_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam",
"samtoolsIdx.BAI_FILENAME": "C345.TCGA-B0-5094-11A-01D-1421-08.5_gdc_realn.bam.bai",
"gatk.DISK_SIZE": "100",
"gatk.JAVASIZE": "5",
"gatk.GSTOREDIR": <REDACTED>,
"gatk.BAM": <REDACTED>,
"gatk.BAI": <REDACTED>,
"gatk.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
"gatk.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
"gatk.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
"gatk.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
"gatk.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
"varscan.DISK_SIZE": "100",
"varscan.JAVASIZE": "5",
"varscan.GSTOREDIR": <REDACTED>",
"varscan.BAM": <REDACTED>,
"varscan.BAI": <REDACTED>,
"varscan.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
"varscan.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
"varscan.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
"varscan.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
"varscan.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
"pindel.DISK_SIZE": "100",
"pindel.JAVASIZE": "5",
"pindel.GSTOREDIR": <REDACTED>,
"pindel.BAM": <REDACTED>,
"pindel.BAI": <REDACTED>,
"pindel.REF": "gs://germline-testing/inputs/GRCh37-lite.fa",
"pindel.REF_IDX": "gs://germline-testing/inputs/GRCh37-lite.fa.fai",
"pindel.REF_DICT": "gs://germline-testing/inputs/GRCh37-lite.dict",
"pindel.REGIONSFILE": "gs://germline-testing/inputs/regions.chr22.list",
"pindel.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
"combine.DISK_SIZE": "100",
"combine.JAVASIZE": "5",
"combine.GSTOREDIR": <REDACTED>,
"combine.GATK_ALL_SNP": <REDACTED>,
"combine.GATK_ALL_INDEL": <REDACTED>,
"combine.VARSCAN_ALL_SNP": <REDACTED>,
"combine.VARSCAN_ALL_INDEL": <REDACTED>,
"combine.PINDEL_ALL_INDEL": <REDACTED>,
"combine.PINDEL_ALL_INDEL_FILTERED": <REDACTED>,
"combine.RUN_GERMLINE": "gs://germline-testing/inputs/run_germline.sh",
"vcf2bq.DATASET_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
"vcf2bq.VCF_URL": "${combine.RESULTS}",
"vcf2bq.BQ_TABLE_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
"vcf2bq.BQ_DATASET_NAME": "germlineVC-0001801b-54b0-4551-8d7a-d66fb59429bf",
"vcf2bq.PROJECT_ID": "isb-cgc-data-02",
"vcf2bq.RUN_VCF2BQ": "gs://germline-testing/inputs/vcf2gg2bq.sh"
},
"outputs": {
"gdcDl.DOWNLOADED": "${GDC_DATA_DESTINATION}",
"samtoolsIdx.BAI": "${GDC_DATA_DESTINATION}/${BAI_FILENAME}",
"gatk.ALL_SNP": "${RESULTS_DESTINATION}/${UUID}/gatk/gatk.all.snp.vcf.gz",
"gatk.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/gatk/gatk.all.indel.vcf.gz",
"varscan.ALL_SNP": "${RESULTS_DESTINATION}/${UUID}/varscan/varscan.all.snp.vcf.gz",
"varscan.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/varscan/varscan.all.indel.vcf.gz",
"pindel.ALL_INDEL": "${RESULTS_DESTINATION}/${UUID}/pindel/pindel.all.indel.vcf.gz",
"pindel.ALL_INDEL_FILTERED": "${RESULTS_DESTINATION}/${UUID}/pindel/pindel.all.indel.filtered.vcf.gz",
"combine.RESULTS": "${RESULTS_DESTINATION}/${UUID}"
},
"logging": {
"gcsPath": "gs://germline-testing/dockerflow/logs/${workflow.element}/task.log"
},
"resources": {},
"serviceAccount": {
"email": "default",
"scopes": [
"https://www.googleapis.com/auth/genomics",
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.full_control"
]
}
}
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask processElement
INFO: Starting task
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.runner.TaskRunner runTask
INFO: Pipelines API request: {
"ephemeralPipeline": {
"name": "gdcDl",
"description": "Download a file from the GDC given a file UUID and copy the resulting file to GCS",
"projectId": "isb-cgc-data-02",
"inputParameters": [
{
"name": "UUID"
},
{
"name": "DISK_SIZE"
},
{
"name": "GDC_TOKEN",
"localCopy": {
"disk": "gdc-data",
"path": <REDACTED>
}
}
],
"outputParameters": [
{
"name": "DOWNLOADED",
"localCopy": {
"disk": "gdc-data",
"path": "1430729090-dockerflow"
}
}
],
"resources": {
"minimumCpuCores": "4",
"minimumRamGb": "16",
"preemptible": true,
"zones": [
"us-central1-a",
"us-central1-b",
"us-central1-c",
"us-central1-f",
"us-east1-b",
"us-east1-c",
"us-east1-d",
"us-west1-a",
"us-west1-b"
],
"disks": [
{
"name": "gdc-data",
"type": "PERSISTENT_SSD",
"mountPoint": "/gdc-data",
"sizeGb": "100"
}
]
},
"docker": {
"imageName": "b.gcr.io/isb-cgc-public-docker-images/gdc-client",
"cmd": "cd /gdc-data \u0026\u0026 gdc-client download -t $GDC_TOKEN ${UUID}"
}
},
"pipelineArgs": {
"projectId": "isb-cgc-data-02",
"inputs": {
"DISK_SIZE": "100",
"GDC_TOKEN": <REDACTED>,
"UUID": "0001801b-54b0-4551-8d7a-d66fb59429bf"
},
"outputs": {
"DOWNLOADED": <REDACTED>
},
"logging": {
"gcsPath": <REDACTED>
},
"resources": {
"zones": []
}
}
}
Sep 22, 2016 12:32:37 AM com.google.cloud.genomics.dockerflow.runner.TaskRunner callAsyncWebService
INFO: Call Pipelines API.
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: com.google.cloud.genomics.dockerflow.runner.TaskException: Error starting Docker task gdcDl. Cause: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
"error": {
"code": 401,
"message": "The request does not have valid authentication credentials.",
"status": "UNAUTHENTICATED"
}
}
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:186)
at com.google.cloud.genomics.dockerflow.Dockerflow.main(Dockerflow.java:230)
Caused by: com.google.cloud.genomics.dockerflow.runner.TaskException: Error starting Docker task gdcDl. Cause: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
"error": {
"code": 401,
"message": "The request does not have valid authentication credentials.",
"status": "UNAUTHENTICATED"
}
}
at com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask.processElement(DockerDo.java:248)
Caused by: java.io.IOException: HTTP error: 401 for url https://genomics.googleapis.com/v1alpha2/pipelines:run{
"error": {
"code": 401,
"message": "The request does not have valid authentication credentials.",
"status": "UNAUTHENTICATED"
}
}
at com.google.cloud.genomics.dockerflow.util.HttpUtils.doPost(HttpUtils.java:90)
at com.google.cloud.genomics.dockerflow.runner.TaskRunner.callAsyncWebService(TaskRunner.java:169)
at com.google.cloud.genomics.dockerflow.runner.TaskRunner.runTask(TaskRunner.java:86)
at com.google.cloud.genomics.dockerflow.transform.DockerDo$StartTask.processElement(DockerDo.java:241)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:138)
at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateHelper(ParDo.java:1229)
at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateSingleHelper(ParDo.java:1098)
at com.google.cloud.dataflow.sdk.transforms.ParDo.access$300(ParDo.java:457)
at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:1084)
at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:1079)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:858)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:219)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:215)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:102)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:259)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:814)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:526)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:96)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:180)
at com.google.cloud.genomics.dockerflow.Dockerflow.main(Dockerflow.java:230)
It looks like maybe the service account info is not making it into the individual pipeline requests? I see that the info is in the pipelineArgs section of the workflow but not for the individual steps...
This looks like a different error from Sean's, since you're able to start the pipeline successfully. That's progress!
I think I've seen the error you got (401, Unauthenticated) when running using DirectPipelineRunner when I closed my laptop or it went to sleep during the middle of the pipeline run, and local Dataflow lost internet access to make the web service calls.
Have you tried running Dataflow itself in the cloud with the default runner (either omit the --runner option or use --runner=DataflowPipelineRunner)?
If yes, can you share the command-line call, or email me privately with enough info that I can try reproducing? Thanks!
And FYI that Dockerflow uses an access token for API calls and GCS access. The code is this:
String token = com.google.api.client.googleapis.auth.oauth2.GoogleCredential.getApplicationDefault().getAccessToken();
You can test that it works with a GCS bucket you have access to, like this:
curl https://storage.googleapis.com/MY-BUCKET/MY-PATH?access_token=MY-TOKEN
where MY-BUCKET/MY-PATH is the GCS path (without the gs:// prefix) and MY-TOKEN is obtained with the code above.
If this ends up as a common enough problem, I can create a super-simple command-line to check the access token only.
Cool, thanks for the info about the access token. I was able to get it working after running gcloud auth login
on my instance, with and without the "--runner" option. I've actually just submitted my first job using Dataflow in the cloud -- the execution graph in the Cloud Console is pretty cool!
Great, glad that it's working!
The cool execution graph is the whole reason I wrote Dockerflow :-)
Another gcloud auth login
did the trick for me. I agree with @vardaofthevalier, this is really cool. @jbingham, do you know the scale of this system? I have workflows with >10k tasks. Would this be expected to work?
In theory it should work. Things to know if you're running O(10k) concurrent tasks:
The Pipelines API will queue your work if you don't have enough cores of quota.
It's recommended to provide more zones, like "us-*", so you can spread out work more.
Dockerflow will abort by default if any of the individual 10k tasks doesn't complete. You can pass the flag --abort=false to turn this off (I'll add it to the --help message; just realized it's not documented. It's also not tested yet, so lmk if it doesn't work right.)
Otherwise, I'm looking forward to hearing how it works for you!
These are great points, @jbingham. The 10k tasks will not be concurrent necessarily due to dependencies, but hundreds may be running simultaneously. Is there a place where I can look at the various quotas that might impact a dockerflow run, particularly with respect to cores, disk, and memory?
The main quotas are for Compute Engine (cores, disk, IP addresses). You can check and increase them here: https://console.cloud.google.com/compute/quotas
Perfect.
Thanks for the new project! This looks quite interesting. I wanted to give this a quick test and ran into the following problem. I have activated cloud dataflow API (and the others) and I thought that would allow me to run Dockerflow workflows. What did I miss?