How to debug failed "runbatch" jobs

grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud

Apache License 2.0

965 stars 52 forks source link

How to debug failed "runbatch" jobs #48

Closed olgabot closed 6 years ago

olgabot commented 6 years ago

Hello, I ran this workflow on this batch (52k samples, stderr + stdout are saved there too) and the job failed on my m4.large EC2 instance. Some jobs finished (below) but most didn't.

screen shot 2018-06-26 at 2 45 37 pm

However when I check reflow listbatch, there's still a bunch of jobs "waiting" and I'm not sure how to terminate or restart them.

 ✘  Tue 26 Jun - 21:39  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow listbatch
A1-B002427-3_39_F-1-1_trim=false_scaled=100     12af8e26 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1000    9568d8b3 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1100    6bcd3d1a waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1200    8508d407 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1300    c38167a2 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1400    297c2a9f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1500    6daec549 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1600    345e906f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1700    1d2d96df waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1800    c2031915 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1900    8aefd12a waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=200     273354de waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=2000    d026b17e waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=2500    9e06336f waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=300     dee9cefc waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=3000    0efb3c07 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=3500    61df5887 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=400     818f00a6 waiting

Looking into the documentation, it seems that reflow repair is what I should use here but I don't understand what path is here. I thought it was the path to the current batch folder but none of what I tried worked:

 ✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow repair -help
usage: reflow repair -batch samples.csv path | repair path [args]

Repair performs cache repair by cache-assisted pseudo-evaluation of
the provided reflow program. The program (evaluated with its arguments)
is evaluated by performing logical cache lookups in place of executor
evaluation. When values are missing and are immediately computable,
they are computed. Flow nodes that are successfully computed this way
are written back to the cache with all available keys. Repair is used to 
perform forward-migration of caching scheme, or back-filling when 
evaluations strategies change (e.g., bottomup vs. topdown evaluation).

Repair accepts command line arguments as in "reflow run" or parameters
supplied via a CSV batch file as in "reflow runbatch".

Flags:
  -batch string
        batch file to process
  -getconcurrency int
        number of concurrent assoc gets (default 50)
  -help
        display subcommand help
  -writebackconcurrency int
        number of concurrent writeback threads (default 20)

 Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow repair -batch samples.csv .
unknown file extension "."

 ✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow repair -batch samples.csv  
usage: reflow repair -batch samples.csv path | repair path [args]
Flags:
  -batch string
        batch file to process
  -getconcurrency int
        number of concurrent assoc gets (default 50)
  -help
        display subcommand help
  -writebackconcurrency int
        number of concurrent writeback threads (default 20)

 ✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀ 
 ubuntu@ip-172-31-42-179  reflow repair .                  
unknown file extension "."

Can you let me know how to proceed? Thank you! Warmest, Olga

prasadgopal commented 6 years ago

All you need to do is run: reflow runbatch -retry

repair is used to repair the cache. "

Repair is used to perform forward-migration of caching scheme, or back-filling when evaluations strategies change (e.g., bottomup vs. topdown evaluation)."

On Tue, Jun 26, 2018 at 2:48 PM Olga Botvinnik notifications@github.com wrote:

Hello, I ran this workflow https://github.com/czbiohub/reflow-workflows/blob/master/sourmash.rf on this batch https://github.com/czbiohub/kmer-hashing/tree/master/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison (52k samples, stderr + stdout are saved there too) and the job failed on my m4.large EC2 instance. Some jobs finished (below) but most didn't.

[image: screen shot 2018-06-26 at 2 45 37 pm] https://user-images.githubusercontent.com/806256/41941076-c2c49878-794f-11e8-8b0b-2739209f6ac2.png

However when I check reflow listbatch, there's still a bunch of jobs "waiting" and I'm not sure how to terminate or restart them.

✘  Tue 26 Jun - 21:39  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀  ubuntu@ip-172-31-42-179  reflow listbatch A1-B002427-3_39_F-1-1_trim=false_scaled=100 12af8e26 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1000 9568d8b3 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1100 6bcd3d1a waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1200 8508d407 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1300 c38167a2 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1400 297c2a9f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1500 6daec549 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1600 345e906f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1700 1d2d96df waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1800 c2031915 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1900 8aefd12a waiting A1-B002427-3_39_F-1-1_trim=false_scaled=200 273354de waiting A1-B002427-3_39_F-1-1_trim=false_scaled=2000 d026b17e waiting A1-B002427-3_39_F-1-1_trim=false_scaled=2500 9e06336f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=300 dee9cefc waiting A1-B002427-3_39_F-1-1_trim=false_scaled=3000 0efb3c07 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=3500 61df5887 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=400 818f00a6 waiting

Looking into the documentation, it seems that reflow repair is what I should use here but I don't understand what path is here. I thought it was the path to the current batch folder but none of what I tried worked:

✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀  ubuntu@ip-172-31-42-179  reflow repair -help usage: reflow repair -batch samples.csv path | repair path [args]

Repair performs cache repair by cache-assisted pseudo-evaluation of the provided reflow program. The program (evaluated with its arguments) is evaluated by performing logical cache lookups in place of executor evaluation. When values are missing and are immediately computable, they are computed. Flow nodes that are successfully computed this way are written back to the cache with all available keys. Repair is used to perform forward-migration of caching scheme, or back-filling when evaluations strategies change (e.g., bottomup vs. topdown evaluation).

Repair accepts command line arguments as in "reflow run" or parameters supplied via a CSV batch file as in "reflow runbatch".

Flags: -batch string batch file to process -getconcurrency int number of concurrent assoc gets (default 50) -help display subcommand help -writebackconcurrency int number of concurrent writeback threads (default 20)

Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀  ubuntu@ip-172-31-42-179  reflow repair -batch samples.csv . unknown file extension "."

✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀  ubuntu@ip-172-31-42-179  reflow repair -batch samples.csv usage: reflow repair -batch samples.csv path | repair path [args] Flags: -batch string batch file to process -getconcurrency int number of concurrent assoc gets (default 50) -help display subcommand help -writebackconcurrency int number of concurrent writeback threads (default 20)

✘  Tue 26 Jun - 21:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 6☀  ubuntu@ip-172-31-42-179  reflow repair . unknown file extension "."

Can you let me know how to proceed? Thank you! Warmest, Olga

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q3rhazykyblZb_atHE2ZBd6qfi4Nks5uAqwXgaJpZM4U4uwZ .

olgabot commented 6 years ago

Thanks, I see that now! What is the difference between -retry and -reset? Is -reset removing the cache for a run?

prasadgopal commented 6 years ago

-retry will retry the ones have failed and run the ones that never ran. -reset will reset the state of the batch from the previous runs and start fresh. This means that we will try to reevaluate all the samples in the batch. But all intermediate results that were already computed and stored in the cache from previous runs will be reused.

On Tue, Jun 26, 2018 at 3:08 PM Olga Botvinnik notifications@github.com wrote:

Thanks, I see that now! What is the difference between -retry and -reset? Is -reset removing the cache for a run?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-400478725, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q9dUrf7-2aQpZn3rz3WhGCm4obcnks5uArDDgaJpZM4U4uwZ .

olgabot commented 6 years ago

Thanks @prasadgopal! It seems that my batches are failing within a few hours so it seems like I'm overall doing something wrong and should fix my batch setup. How can I check how quickly the batch is dying, and what's contributing to the failure?

prasadgopal commented 6 years ago

reflow info should show more info about each sample.

from a toy example I ran:

reflow listbatch

1 c513ffff done

2 fc60e67a done

3 eead713f done

4 36281369 done

5 10eb2f7b done

reflow info c513ffff

c513ffffc99c76910a7f2289142fcc60a7caba39f3fafe75c29952630a4d66bf (run)

time:    Thu Jun 28 09:54:10 2018

program: graph.rf

params:

           de: 1

phase:     Done

alloc:

ec2-54-189-99-164.us-west-2.compute.amazonaws.com:9000/6297baca77d89e0d

resources: {mem:3.5GiB cpu:1 disk:2.4TiB intel_avx:1}

result:    val<>

log:

/Users/pgopal/.reflow/runs/c513ffffc99c76910a7f2289142fcc60a7caba39f3fafe75c29952630a4d66bf.execlog

On Thu, Jun 28, 2018 at 10:32 AM Olga Botvinnik notifications@github.com wrote:

Thanks @prasadgopal https://github.com/prasadgopal! It seems that my batches are failing within a few hours so it seems like I'm overall doing something wrong and should fix my batch setup. How can I check how quickly the batch is dying, and what's contributing to the failure?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401113798, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7321VdDlz-HgaKze34buz-Pr9iCks5uBRM2gaJpZM4U4uwZ .

olgabot commented 6 years ago

Thanks @prasadgopal ! For these jobs, I'm getting a memory error when I reflow info a particular job. I'm running 52k jobs on a m4.large machine which has 8 GB ram. Should I be on a machine with more RAM?

 ✘  Thu 28 Jun - 18:49  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  reflow listbatch | head
A1-B002427-3_39_F-1-1_trim=false_scaled=100     9a2b6a27 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1000    cd1f2346 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1100    f72ab8bf waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1200    0f2ed555 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1300    baa7ca80 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1400    396fb5b8 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1500    86adda8c waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1600    7eb6ac08 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1700    c4767cd0 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1800    56ed2cf9 waiting

 Thu 28 Jun - 19:01  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  reflow info 9a2b6a27
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbb8ade]

goroutine 1 [running]:
github.com/grailbio/reflow/tool.(*Cmd).printCacheInfo(0xc420418000, 0xf67200, 0xc4200af440, 0xf60980, 0xc420192500, 0x5, 0x276a2b9a, 0x0, 0x0, 0x0, ...)
        /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:202 +0xce
github.com/grailbio/reflow/tool.(*Cmd).info(0xc420418000, 0xf67180, 0xc42041a500, 0xc420090080, 0x1, 0x1)
        /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:62 +0x4f2
github.com/grailbio/reflow/tool.(*Cmd).Main(0xc420418000)
        /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/main.go:298 +0x9fb
main.main()
        /home/ubuntu/gocode/src/github.com/grailbio/reflow/cmd/reflow/main.go:62 +0x3fc

prasadgopal commented 6 years ago

The stack trace you are seeing is some bug/inconsistency in reflow code. Can you tell me if ls -l $HOME/.reflow/runs/9a2b6a27* returns something?

On Thu, Jun 28, 2018 at 12:03 PM Olga Botvinnik notifications@github.com wrote:

Thanks @prasadgopal https://github.com/prasadgopal ! For these jobs, I'm getting a memory error when I reflow info a particular job. I'm running 52k jobs on a m4.large https://ec2instances.info/?filter=m4.large machine which has 8 GB ram. Should I be on a machine with more RAM?

✘  Thu 28 Jun - 18:49  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  reflow listbatch | head A1-B002427-3_39_F-1-1_trim=false_scaled=100 9a2b6a27 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1000 cd1f2346 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1100 f72ab8bf waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1200 0f2ed555 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1300 baa7ca80 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1400 396fb5b8 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1500 86adda8c waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1600 7eb6ac08 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1700 c4767cd0 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1800 56ed2cf9 waiting

Thu 28 Jun - 19:01  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  reflow info 9a2b6a27 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbb8ade]

goroutine 1 [running]:github.com/grailbio/reflow/tool.(Cmd).printCacheInfo(0xc420418000, 0xf67200, 0xc4200af440, 0xf60980, 0xc420192500, 0x5, 0x276a2b9a, 0x0, 0x0, 0x0, ...) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:202 +0xcegithub.com/grailbio/reflow/tool.(Cmd).info(0xc420418000, 0xf67180, 0xc42041a500, 0xc420090080, 0x1, 0x1) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:62 +0x4f2github.com/grailbio/reflow/tool.(*Cmd).Main(0xc420418000) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/main.go:298 +0x9fb main.main() /home/ubuntu/gocode/src/github.com/grailbio/reflow/cmd/reflow/main.go:62 +0x3fc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401140026, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q_nDWWty8iRoFlYublMG6dC0g8Zzks5uBShngaJpZM4U4uwZ .

olgabot commented 6 years ago

Yeah it shows this:

 ✘  Thu 28 Jun - 20:23  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  ls -l $HOME/.reflow/runs/9a2b6a27*
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog
-rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock

Is this normal?

olgabot commented 6 years ago

Also that folder has TONS of files!!!

 Thu 28 Jun - 20:23  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  ls -1 $HOME/.reflow/runs| wc -l
450993

prasadgopal commented 6 years ago

It is missing a .json file which reflow info looks for to find the state of the runs. I'll need to dig a little in code to see why the .json isn't being written out.

Yes, the runs dir has a folder for each run.

On Thu, Jun 28, 2018 at 1:24 PM Olga Botvinnik notifications@github.com wrote:

Yeah it shows this:

✘  Thu 28 Jun - 20:23  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  ls -l $HOME/.reflow/runs/9a2b6a27* -rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog -rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock

Is this normal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401161845, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7ITViE503O9CyT-3o792q7WWJ6sks5uBTtxgaJpZM4U4uwZ .

prasadgopal commented 6 years ago

I meant each run produces a bunch of files.

On Thu, Jun 28, 2018 at 1:29 PM Prasad Gopal pgopal@grailbio.com wrote:

It is missing a .json file which reflow info looks for to find the state of the runs. I'll need to dig a little in code to see why the .json isn't being written out.

Yes, the runs dir has a folder for each run.

On Thu, Jun 28, 2018 at 1:24 PM Olga Botvinnik notifications@github.com wrote:

Yeah it shows this:

✘  Thu 28 Jun - 20:23  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  ls -l $HOME/.reflow/runs/9a2b6a27* -rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog -rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock

Is this normal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401161845, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7ITViE503O9CyT-3o792q7WWJ6sks5uBTtxgaJpZM4U4uwZ .

olgabot commented 6 years ago

should I remove the $HOME/.reflow/runs folder to force these jobs to restart?

olgabot commented 6 years ago

I really don't understand what's going on .. reflow is definitely running:

 Thu 28 Jun - 20:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  ps all
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0  1306     1  20   0  14472  1580 -      Ss+  ttyS0      0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
4     0  1307     1  20   0  14656  1428 -      Ss+  tty1       0:00 /sbin/agetty --noclear tty1 linux
0  1000  5629  4891  20   0  21288  3324 wait   Ss   pts/2      0:00 /bin/bash
0  1000  5639  5629  20   0  42864  3184 sigsus S    pts/2      0:00 zsh
0  1000  5708  5639  20   0 361732 53448 ep_pol Sl+  pts/2      0:34 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/jupyter-notebook --ip=*
0  1000  8802  4891  20   0  21288  3412 wait   Ss   pts/3      0:00 /bin/bash
0  1000  8812  8802  20   0  45312  5336 sigsus S    pts/3      0:01 zsh
0  1000 10231  4891  20   0  21288  3308 wait   Ss   pts/1      0:00 /bin/bash
0  1000 10241 10231  20   0  45212  3452 sigsus S    pts/1      0:01 zsh
0  1000 10746 10745  20   0  21392  5196 wait   Ss   pts/0      0:00 -bash
0  1000 10760 10746  20   0  25772  2840 -      S+   pts/0      0:00 screen -x
0  1000 11702 10241  20   0 696928 95704 poll_s Sl+  pts/1      2:13 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython
0  1000 11886  4891  20   0  21288  3324 wait   Ss   pts/4      0:00 /bin/bash
0  1000 11909 11886  20   0  45404  5132 sigsus S    pts/4      0:01 zsh
0  1000 12857  8812  20   0 8023124 6951520 futex_ Sl+ pts/3    8:12 reflow runbatch -retry
0  1000 13235 11909  20   0  40536  3740 poll_s S+   pts/4      0:00 top
0  1000 13292 28963  20   0  27636  1392 -      R+   pts/5      0:00 ps all
0  1000 28953  4891  20   0  21288  3256 wait   Ss   pts/5      0:00 /bin/bash
0  1000 28963 28953  20   0  45404  4940 sigsus S    pts/5      0:01 zsh
0  1000 30611  4891  20   0  21288  3328 wait_w Ss+  pts/6      0:00 /bin/bash
0  1000 30621  4891  20   0  21288  3256 wait_w Ss+  pts/7      0:00 /bin/bash

It's using a lot of CPU:

top - 20:50:09 up 4 days,  4:42,  7 users,  load average: 1.42, 1.18, 0.66
Tasks: 139 total,   2 running, 137 sleeping,   0 stopped,   0 zombie
%Cpu(s): 86.1 us,  4.4 sy,  0.0 ni,  9.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8173840 total,   180304 free,  7330220 used,   663316 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   439872 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                  
12857 ubuntu    20   0 8023124 6.629g  15264 S 105.3 85.0   8:50.93 reflow                                                                                                                                                                   
11702 ubuntu    20   0  696928  95704   4260 S   0.7  1.2   2:13.96 ipython                                                                                                                                                                  
13535 ubuntu    20   0   27236   8600   5344 R   0.7  0.1   0:00.02 aws                                                                                                                                                                      
10745 ubuntu    20   0   92788   3140   2204 S   0.3  0.0   0:00.30 sshd                                                                                                                                                                     
13235 ubuntu    20   0   40536   3740   3112 R   0.3  0.0   0:00.27 top                                                                                                                                                                      
    1 root      20   0   37944   3628   1692 S   0.0  0.0   0:11.48 systemd                                                                                                                                                                  
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd                                                                                                                                                                 
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.38 ksoftirqd/0                                                                                                                                                              
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                                                             
    7 root      20   0       0      0      0 S   0.0  0.0   0:16.38 rcu_sched

But there's no olgabot@localhost (reflow) instance launched in the EC2 management console (yes I'm in the right region) and all of the jobs are waiting:

 Thu 28 Jun - 20:49  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  reflow listbatch | grep -v waiting | head
[no output]

prasadgopal commented 6 years ago

Does reflow batchinfo anything more? Is there a way I can reproduce what you are doing on my side?

On Thu, Jun 28, 2018 at 1:52 PM Olga Botvinnik notifications@github.com wrote:

I really don't understand what's going on .. reflow is definitely running:

Thu 28 Jun - 20:40  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  ps all F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1306 1 20 0 14472 1580 - Ss+ ttyS0 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220 4 0 1307 1 20 0 14656 1428 - Ss+ tty1 0:00 /sbin/agetty --noclear tty1 linux 0 1000 5629 4891 20 0 21288 3324 wait Ss pts/2 0:00 /bin/bash 0 1000 5639 5629 20 0 42864 3184 sigsus S pts/2 0:00 zsh 0 1000 5708 5639 20 0 361732 53448 ep_pol Sl+ pts/2 0:34 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/jupyter-notebook --ip=* 0 1000 8802 4891 20 0 21288 3412 wait Ss pts/3 0:00 /bin/bash 0 1000 8812 8802 20 0 45312 5336 sigsus S pts/3 0:01 zsh 0 1000 10231 4891 20 0 21288 3308 wait Ss pts/1 0:00 /bin/bash 0 1000 10241 10231 20 0 45212 3452 sigsus S pts/1 0:01 zsh 0 1000 10746 10745 20 0 21392 5196 wait Ss pts/0 0:00 -bash 0 1000 10760 10746 20 0 25772 2840 - S+ pts/0 0:00 screen -x 0 1000 11702 10241 20 0 696928 95704 polls Sl+ pts/1 2:13 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython 0 1000 11886 4891 20 0 21288 3324 wait Ss pts/4 0:00 /bin/bash 0 1000 11909 11886 20 0 45404 5132 sigsus S pts/4 0:01 zsh 0 1000 12857 8812 20 0 8023124 6951520 futex Sl+ pts/3 8:12 reflow runbatch -retry 0 1000 13235 11909 20 0 40536 3740 poll_s S+ pts/4 0:00 top 0 1000 13292 28963 20 0 27636 1392 - R+ pts/5 0:00 ps all 0 1000 28953 4891 20 0 21288 3256 wait Ss pts/5 0:00 /bin/bash 0 1000 28963 28953 20 0 45404 4940 sigsus S pts/5 0:01 zsh 0 1000 30611 4891 20 0 21288 3328 wait_w Ss+ pts/6 0:00 /bin/bash 0 1000 30621 4891 20 0 21288 3256 wait_w Ss+ pts/7 0:00 /bin/bash

It's using a lot of CPU:

top - 20:50:09 up 4 days, 4:42, 7 users, load average: 1.42, 1.18, 0.66 Tasks: 139 total, 2 running, 137 sleeping, 0 stopped, 0 zombie %Cpu(s): 86.1 us, 4.4 sy, 0.0 ni, 9.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8173840 total, 180304 free, 7330220 used, 663316 buff/cache KiB Swap: 0 total, 0 free, 0 used. 439872 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12857 ubuntu 20 0 8023124 6.629g 15264 S 105.3 85.0 8:50.93 reflow 11702 ubuntu 20 0 696928 95704 4260 S 0.7 1.2 2:13.96 ipython 13535 ubuntu 20 0 27236 8600 5344 R 0.7 0.1 0:00.02 aws 10745 ubuntu 20 0 92788 3140 2204 S 0.3 0.0 0:00.30 sshd 13235 ubuntu 20 0 40536 3740 3112 R 0.3 0.0 0:00.27 top 1 root 20 0 37944 3628 1692 S 0.0 0.0 0:11.48 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:01.38 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0.0 0.0 0:16.38 rcu_sched

But there's no olgabot@localhost (reflow) instance launched in the EC2 management console (yes I'm in the right region) and all of the jobs are waiting:

Thu 28 Jun - 20:49  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀  ubuntu@ip-172-31-42-179  reflow listbatch | grep -v waiting | head [no output]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401169548, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q37Q3AJMCGM3FS8_wb7U7WJwsa7Xks5uBUH7gaJpZM4U4uwZ .

olgabot commented 6 years ago

wait it seems to be working now! I removed the runs folder, the state files, and all of the logs, and am running reflow runbatch -reset -retry -gc and now I see some ubuntu@localhost (reflow) instances running (since ubuntu is the username on my EC2 machine). Adding -user olgabot makes a weird error about permissions, so is there a way to have the machines be named olgabot@localhost (reflow) instead?

olgabot commented 6 years ago

reflow batchinfo showed this:

 Thu 28 Jun - 17:21  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master ✔ 4☀ 
 ubuntu@ip-172-31-42-179  reflow batchinfo
run A1-B002427-3_39_F-1-1_trim=false_scaled=100: 12af8e26
    log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=100
run A1-B002427-3_39_F-1-1_trim=false_scaled=1000: 9568d8b3
    log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1000
run A1-B002427-3_39_F-1-1_trim=false_scaled=1100: 6bcd3d1a
    log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1100
run A1-B002427-3_39_F-1-1_trim=false_scaled=1200: 8508d407
    log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1200

olgabot commented 6 years ago

In case it's useful to anyone, here are some of the failure modes I'm getting:

This is on a t2.2xlarge by the way

job killed

This looks like I hit Ctrl-C to kill the job which makes me think it ran out of memory on the instance

reflow: run A20-B000971-3_39_F-1-1_trim=false_scaled=700: error: put fc4d046a8e5814c3 sha256:f813c2806d624b66f0fc40377191f20a3fa052c66bd4ced79d4744bfc88abcf7 execconfig extern url s3://olgabot-maca/facs/sourmash_dna-only_trim=false_scaled=700/A20-B000971-3_39_F-1-1.signature resources {}: operation not supported: zombie alloc

ec2cluster: 10 instances: c5.9xlarge:5,m4.16xlarge:5 (<=$23.7/hr), total{mem:1.5TiB cpu:500 disk:2.4TiB intel_avx:500 intel_avx2:500 intel_avx512:180}, waiting{mem:207.4TiB cpu:6638 disk:3.2TiB}, pending{mem:0B cpu:0 disk:0B}

  allocate {mem:64.0GiB cpu:2 disk:1.0GiB}[51971]:  provisioning new instance  11m43s-19m35s

[1]    6369 killed     reflow runbatch -retry -reset -gc

 ✘  Fri 29 Jun - 20:20  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master 4☀ 2‒ 

 ubuntu@ip-172-31-42-179 

IO timeouts

reflow: ec2cluster: error while waiting for offers: offers ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp 34.220.249.100:9000: i/o timeout
reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp: lookup ec2-18-237-239-92.us-west-2.compute.amazonaws.com on 172.31.0.2:53: dial udp 172.31.0.2:53: i/o timeout

prasadgopal commented 6 years ago

Job got killed most likely because reflow ran out of memory. I think the 52k samples is possibly what is causing it to OOM. How much memory do you have on the host machine where you invoking reflow?

IO timeouts should be ok. reflow will recover from those errors.

On Fri, Jun 29, 2018 at 3:02 PM Olga Botvinnik notifications@github.com wrote:

In case it's useful to anyone, here are some of the failure modes I'm getting: job killed

This looks like I hit Ctrl-C to kill the job which makes me think it ran out of memory on the instance

reflow: run A20-B000971-3_39_F-1-1_trim=false_scaled=700: error: put fc4d046a8e5814c3 sha256:f813c2806d624b66f0fc40377191f20a3fa052c66bd4ced79d4744bfc88abcf7 execconfig extern url s3://olgabot-maca/facs/sourmash_dna-only_trim=false_scaled=700/A20-B000971-3_39_F-1-1.signature resources {}: operation not supported: zombie alloc

ec2cluster: 10 instances: c5.9xlarge:5,m4.16xlarge:5 (<=$23.7/hr), total{mem:1.5TiB cpu:500 disk:2.4TiB intel_avx:500 intel_avx2:500 intel_avx512:180}, waiting{mem:207.4TiB cpu:6638 disk:3.2TiB}, pending{mem:0B cpu:0 disk:0B}

allocate {mem:64.0GiB cpu:2 disk:1.0GiB}[51971]: provisioning new instance 11m43s-19m35s

[1] 6369 killed reflow runbatch -retry -reset -gc

✘  Fri 29 Jun - 20:20  ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison   origin ☊ master 4☀ 2‒ 

ubuntu@ip-172-31-42-179 

IO timeouts

reflow: ec2cluster: error while waiting for offers: offers ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp 34.220.249.100:9000: i/o timeout reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp: lookup ec2-18-237-239-92.us-west-2.compute.amazonaws.com on 172.31.0.2:53: dial udp 172.31.0.2:53: i/o timeout

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401485102, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q1m16J_t4aGP5VdN8FHu-n0-5gojks5uBqPlgaJpZM4U4uwZ .

mariusae commented 6 years ago

This is on a t2.2xlarge by the way

This is the instance type where the reflow process is running, or Reflow is using instances of this type? I agree with @prasadgopal -- this error seems to indicate you were killed by the kernel.

olgabot commented 6 years ago

This is the instance type where the reflow process was running. It was indeed "OOM" in the kern.log

olgabot commented 6 years ago

Running on a machine with more memory solved my problem. I'm now running reflow on-prem with a beefy 128-core, 2TB RAM machine and haven't had any problems yet :) I do think users would appreciate showing an out of memory error from Reflow so they can more quickly solve their problem rather than restarting a bunch of jobs on a machine that can't handle all of them like I did.

Is the memory usage a reflection of the DAG used to create the jobs? Or caching the results in between? Because on the smaller machine, Reflow would run fine for about an hour and then crapped out after some jobs had finished, and that's inconsistent to me with the creation of the DAG but something that's happening to the cached results as jobs get done.