grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
967 stars 52 forks source link

"Initializing" takes forever #103

Open olgabot opened 5 years ago

olgabot commented 5 years ago

Hello,

I'm running this workflow:

```golang param ( // S3 path to 10x folder tenx string // Full s3 file location to put the sourmash signature output string // Size of kmer(s) to use ksizes = "21,33,51" // choose number of hashes as 1/scaled of input k-mers scaled = 0 // Number of kmer hashes to use num_hashes = 1000 // Calculate protein signature protein = true // Calculate DNA signature dna = true // Number of processes processes = 8 // Name of the bam file in the tenx folder BAM_FILENAME = "possorted_genome_bam.bam" // Name of the single-column barcodes file in the tenx folder BARCODES = "barcodes.tsv" ) // Instantiate the system modules "files" (system modules begin // with $), assigning its instance to the "files" identifier. To // view the documentation for this module, run "reflow doc // $/files". val files = make("$/files") val dirs = make("$/dirs") sourmash := make("./sourmash.rf") // bam2fastx Docker image val bam2fastx = "czbiohub/bam2fastx" // Compute a minhash signature for a sample @requires(cpu := 4, mem := 16*GiB, disk := 4*GiB) func TenXBamToFasta(tenx dir) = { outdir := exec(image := bam2fastx) (output dir) {" bam2fastx fasta {{tenx}} --all-cells-in-one-file --output {{output}} "} val (fasta, _) = dirs.Pick(outdir, "*.fasta") // Return single fasta fasta } // Instantiate Go system module "strings" val strings = make("$/strings") @requires(cpu := 1, mem := 16*GiB) val Main = { val tenx_folder = dir(tenx) val (bam, _) = dirs.Pick(tenx_folder, "*.bam") val (bai, _) = dirs.Pick(tenx_folder, "*.bai") val (barcodes, _) = dirs.Pick(tenx_folder, BARCODES) val renamed = map([(BAM_FILENAME, bam), (BAM_FILENAME + ".bai", bai), (BARCODES, barcodes)]) val minimal_tenx_dir = dirs.Make(renamed) fasta := TenXBamToFasta(minimal_tenx_dir) reads := [fasta] singleton := false sourmash_sketch := sourmash.Compute(reads, scaled, ksizes, protein, dna, singleton) files.Copy(sourmash_sketch, output) } ```

The data gets transferred just fine but then the reflow run command claims the job is running and yet the reflow ps command shows it is initializing. Who is right? I've been stuck at the "initalizing" phase for many hours for this file, this is just a fresh example to show the inputs.

Below is a screenshot of the output from this command:

reflow -log=debug -cache=off run -trace /home/olga/code/kmer-hashing/sourmash/maca/10x_spleen_kidney/../../../reflow/sourmash_compute_10x.rf -tenx s3://czbiohub-maca/10x_data/10X_P4_7 -output s3://olgabot-maca/10x/sourmash_compute/ksizes=21,27,33,51_num_hashes=5000/Spleen_10X_P4_7.sig -ksizes 21,27,33,51 -num_hashes 5000
screen shot 2019-02-08 at 8 20 07 am

Thank you! Warmest, Olga

olgabot commented 5 years ago

Update: this is still "initializing" ...

screen shot 2019-02-12 at 7 42 13 am
olgabot commented 5 years ago

Here's the end of that text:

2019/02/12 07:32:10 ec2cluster: pending{}
2019/02/12 07:33:10 ec2cluster: pending{}
2019/02/12 07:34:10 ec2cluster: pending{}
2019/02/12 07:35:10 ec2cluster: pending{}
2019/02/12 07:36:10 ec2cluster: pending{}
2019/02/12 07:37:10 ec2cluster: pending{}
2019/02/12 07:38:10 ec2cluster: pending{}
2019/02/12 07:39:10 ec2cluster: pending{}
2019/02/12 07:40:10 ec2cluster: pending{}
2019/02/12 07:41:10 ec2cluster: pending{}
2019/02/12 07:42:10 ec2cluster: pending{}
2019/02/12 07:43:10 ec2cluster: pending{}
2019/02/12 07:44:10 ec2cluster: pending{}
2019/02/12 07:45:10 ec2cluster: pending{}
2019/02/12 07:46:10 ec2cluster: pending{}
2019/02/12 07:47:10 ec2cluster: pending{}
2019/02/12 07:48:10 ec2cluster: pending{}
2019/02/12 07:49:10 ec2cluster: pending{}
2019/02/12 07:50:10 ec2cluster: pending{}
2019/02/12 07:51:10 ec2cluster: pending{}
2019/02/12 07:52:10 ec2cluster: pending{}
2019/02/12 07:53:10 ec2cluster: pending{}
2019/02/12 07:54:10 ec2cluster: pending{}
2019/02/12 07:55:10 ec2cluster: pending{}
ec2cluster: 1 instances: r4.xlarge:1 (<=$0.3/hr), total{mem:28.4GiB cpu:4 disk:2.4TiB intel_avx:4 intel_avx2:4}, waiting{}, pend
48cfcb94: elapsed: 96h0m, running:1, completed: 1/3
  sourmash_compute_10x.TenXBamToFasta.outdir:  exec czbiohub/bam2fastx bam2fastx fasta {{tenx}} --al..-one-file --outp  96h6m18s
prasadgopal commented 5 years ago

What version of reflow are you running? What command did you run? We need to make a new github release because the last release (0.6.8) was back in August. If you are building your own binary, please let me know how you built it.

On Tue, Feb 12, 2019 at 7:55 AM Olga Botvinnik notifications@github.com wrote:

Here's the end of that text:

2019/02/12 07:32:10 ec2cluster: pending{} 2019/02/12 07:33:10 ec2cluster: pending{} 2019/02/12 07:34:10 ec2cluster: pending{} 2019/02/12 07:35:10 ec2cluster: pending{} 2019/02/12 07:36:10 ec2cluster: pending{} 2019/02/12 07:37:10 ec2cluster: pending{} 2019/02/12 07:38:10 ec2cluster: pending{} 2019/02/12 07:39:10 ec2cluster: pending{} 2019/02/12 07:40:10 ec2cluster: pending{} 2019/02/12 07:41:10 ec2cluster: pending{} 2019/02/12 07:42:10 ec2cluster: pending{} 2019/02/12 07:43:10 ec2cluster: pending{} 2019/02/12 07:44:10 ec2cluster: pending{} 2019/02/12 07:45:10 ec2cluster: pending{} 2019/02/12 07:46:10 ec2cluster: pending{} 2019/02/12 07:47:10 ec2cluster: pending{} 2019/02/12 07:48:10 ec2cluster: pending{} 2019/02/12 07:49:10 ec2cluster: pending{} 2019/02/12 07:50:10 ec2cluster: pending{} 2019/02/12 07:51:10 ec2cluster: pending{} 2019/02/12 07:52:10 ec2cluster: pending{} 2019/02/12 07:53:10 ec2cluster: pending{} 2019/02/12 07:54:10 ec2cluster: pending{} 2019/02/12 07:55:10 ec2cluster: pending{} ec2cluster: 1 instances: r4.xlarge:1 (<=$0.3/hr), total{mem:28.4GiB cpu:4 disk:2.4TiB intel_avx:4 intel_avx2:4}, waiting{}, pend 48cfcb94: elapsed: 96h0m, running:1, completed: 1/3 sourmash_compute_10x.TenXBamToFasta.outdir: exec czbiohub/bam2fastx bam2fastx fasta {{tenx}} --al..-one-file --outp 96h6m18s

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/103#issuecomment-462815573, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0QxRuXbdjgkI5NQC4wTEY0je80sJVks5vMuQDgaJpZM4aweY8 .

--

This email message, including attachments, may contain private, proprietary, or privileged information and is the confidential information and/or property of GRAIL, Inc., and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

olgabot commented 5 years ago

There was a bug with 0.6.8 so I don't use it. This is 0.6.7:

 reflow version
0.6.7 (go1.10)