grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
964 stars 52 forks source link

cache.Transfer: unavailable: context canceled #61

Closed olgabot closed 5 years ago

olgabot commented 6 years ago

Hello! I've been trying to run this workflow for some time and have not had any success. First I was running out of space, then I increased the disk space, but this still doesn't seem to help. There's 1000 tens-of-megabytes files to transfer for the run and the job seems to be getting stuck in the transfer. An excerpt is below.

Here is a gist with all the files associated with the run.

(base) 
 ✘  Wed 25 Jul - 12:18  ~/code/kmer-hashing/sourmash/compare   origin ☊ master 1● 
  reflow run ../../reflow/sourmash_compare.rf -signatures=s3://olgabot-maca/facs/sourmash_dna-only_trim=true_scaled=100 -ksize=3 -sequence_to_compare=dna -output=s3://olgabot-maca/facs/sourmash_compare/dna-only_trim=true_scaled=100_ksize=3       
reflow: run ID: 6506e7ca
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
reflow: cache transfer flow afa76d14 state FlowTransfer {mem:1.0GiB cpu:1 disk:0B} exec image czbiohub/kmer-hashing cmd "\n\t\t/opt/conda/bin/sourmash compare \\\n            --ksize 3 \\\n            --force \\\n            --dna \\\n            --output %s \\\n            --traverse-directory \\\n            %s\n    " deps a38a9f51 error: context canceled
reflow: retrying error cache.Transfer: unavailable: context canceled
too many tries:
    cache.Transfer: unavailable: context canceled

Do you know what may be happening? Thank you! Warmest, Olga

mariusae commented 6 years ago

(I mentioned this in Gitter also, but pasting here for posterity.)

Context canceled is usually (always?) downstream error; the error reporting should be better here, but you probably have an error further up in your flow (see logs) that cascade down.

olgabot commented 6 years ago

Thanks! I increased the memory requirements of the program to 16 GiB and it worked fine. There weren't any oom errors the logs, that I could see, so it was more of a guess.