Closed olgabot closed 6 years ago
All you need to do is run: reflow runbatch -retry
repair is used to repair the cache. "
Repair is used to perform forward-migration of caching scheme, or back-filling when evaluations strategies change (e.g., bottomup vs. topdown evaluation)."
On Tue, Jun 26, 2018 at 2:48 PM Olga Botvinnik notifications@github.com wrote:
Hello, I ran this workflow https://github.com/czbiohub/reflow-workflows/blob/master/sourmash.rf on this batch https://github.com/czbiohub/kmer-hashing/tree/master/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison (52k samples, stderr + stdout are saved there too) and the job failed on my m4.large EC2 instance. Some jobs finished (below) but most didn't.
[image: screen shot 2018-06-26 at 2 45 37 pm] https://user-images.githubusercontent.com/806256/41941076-c2c49878-794f-11e8-8b0b-2739209f6ac2.png
However when I check reflow listbatch, there's still a bunch of jobs "waiting" and I'm not sure how to terminate or restart them.
✘ Tue 26 Jun - 21:39 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 6☀ ubuntu@ip-172-31-42-179 reflow listbatch A1-B002427-3_39_F-1-1_trim=false_scaled=100 12af8e26 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1000 9568d8b3 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1100 6bcd3d1a waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1200 8508d407 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1300 c38167a2 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1400 297c2a9f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1500 6daec549 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1600 345e906f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1700 1d2d96df waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1800 c2031915 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1900 8aefd12a waiting A1-B002427-3_39_F-1-1_trim=false_scaled=200 273354de waiting A1-B002427-3_39_F-1-1_trim=false_scaled=2000 d026b17e waiting A1-B002427-3_39_F-1-1_trim=false_scaled=2500 9e06336f waiting A1-B002427-3_39_F-1-1_trim=false_scaled=300 dee9cefc waiting A1-B002427-3_39_F-1-1_trim=false_scaled=3000 0efb3c07 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=3500 61df5887 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=400 818f00a6 waiting
Looking into the documentation, it seems that reflow repair is what I should use here but I don't understand what path is here. I thought it was the path to the current batch folder but none of what I tried worked:
✘ Tue 26 Jun - 21:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 6☀ ubuntu@ip-172-31-42-179 reflow repair -help usage: reflow repair -batch samples.csv path | repair path [args]
Repair performs cache repair by cache-assisted pseudo-evaluation of the provided reflow program. The program (evaluated with its arguments) is evaluated by performing logical cache lookups in place of executor evaluation. When values are missing and are immediately computable, they are computed. Flow nodes that are successfully computed this way are written back to the cache with all available keys. Repair is used to perform forward-migration of caching scheme, or back-filling when evaluations strategies change (e.g., bottomup vs. topdown evaluation).
Repair accepts command line arguments as in "reflow run" or parameters supplied via a CSV batch file as in "reflow runbatch".
Flags: -batch string batch file to process -getconcurrency int number of concurrent assoc gets (default 50) -help display subcommand help -writebackconcurrency int number of concurrent writeback threads (default 20)
Tue 26 Jun - 21:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 6☀ ubuntu@ip-172-31-42-179 reflow repair -batch samples.csv . unknown file extension "."
✘ Tue 26 Jun - 21:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 6☀ ubuntu@ip-172-31-42-179 reflow repair -batch samples.csv usage: reflow repair -batch samples.csv path | repair path [args] Flags: -batch string batch file to process -getconcurrency int number of concurrent assoc gets (default 50) -help display subcommand help -writebackconcurrency int number of concurrent writeback threads (default 20)
✘ Tue 26 Jun - 21:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 6☀ ubuntu@ip-172-31-42-179 reflow repair . unknown file extension "."
Can you let me know how to proceed? Thank you! Warmest, Olga
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q3rhazykyblZb_atHE2ZBd6qfi4Nks5uAqwXgaJpZM4U4uwZ .
Thanks, I see that now! What is the difference between -retry
and -reset
? Is -reset
removing the cache for a run?
-retry will retry the ones have failed and run the ones that never ran. -reset will reset the state of the batch from the previous runs and start fresh. This means that we will try to reevaluate all the samples in the batch. But all intermediate results that were already computed and stored in the cache from previous runs will be reused.
On Tue, Jun 26, 2018 at 3:08 PM Olga Botvinnik notifications@github.com wrote:
Thanks, I see that now! What is the difference between -retry and -reset? Is -reset removing the cache for a run?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-400478725, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q9dUrf7-2aQpZn3rz3WhGCm4obcnks5uArDDgaJpZM4U4uwZ .
Thanks @prasadgopal! It seems that my batches are failing within a few hours so it seems like I'm overall doing something wrong and should fix my batch setup. How can I check how quickly the batch is dying, and what's contributing to the failure?
reflow info
from a toy example I ran:
reflow listbatch
1 c513ffff done
2 fc60e67a done
3 eead713f done
4 36281369 done
5 10eb2f7b done
reflow info c513ffff
c513ffffc99c76910a7f2289142fcc60a7caba39f3fafe75c29952630a4d66bf (run)
time: Thu Jun 28 09:54:10 2018
program: graph.rf
params:
de: 1
phase: Done
alloc:
ec2-54-189-99-164.us-west-2.compute.amazonaws.com:9000/6297baca77d89e0d
resources: {mem:3.5GiB cpu:1 disk:2.4TiB intel_avx:1}
result: val<>
log:
/Users/pgopal/.reflow/runs/c513ffffc99c76910a7f2289142fcc60a7caba39f3fafe75c29952630a4d66bf.execlog
On Thu, Jun 28, 2018 at 10:32 AM Olga Botvinnik notifications@github.com wrote:
Thanks @prasadgopal https://github.com/prasadgopal! It seems that my batches are failing within a few hours so it seems like I'm overall doing something wrong and should fix my batch setup. How can I check how quickly the batch is dying, and what's contributing to the failure?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401113798, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7321VdDlz-HgaKze34buz-Pr9iCks5uBRM2gaJpZM4U4uwZ .
Thanks @prasadgopal ! For these jobs, I'm getting a memory error when I reflow info
a particular job. I'm running 52k jobs on a m4.large machine which has 8 GB ram. Should I be on a machine with more RAM?
✘ Thu 28 Jun - 18:49 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 reflow listbatch | head
A1-B002427-3_39_F-1-1_trim=false_scaled=100 9a2b6a27 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1000 cd1f2346 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1100 f72ab8bf waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1200 0f2ed555 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1300 baa7ca80 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1400 396fb5b8 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1500 86adda8c waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1600 7eb6ac08 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1700 c4767cd0 waiting
A1-B002427-3_39_F-1-1_trim=false_scaled=1800 56ed2cf9 waiting
Thu 28 Jun - 19:01 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 reflow info 9a2b6a27
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbb8ade]
goroutine 1 [running]:
github.com/grailbio/reflow/tool.(*Cmd).printCacheInfo(0xc420418000, 0xf67200, 0xc4200af440, 0xf60980, 0xc420192500, 0x5, 0x276a2b9a, 0x0, 0x0, 0x0, ...)
/home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:202 +0xce
github.com/grailbio/reflow/tool.(*Cmd).info(0xc420418000, 0xf67180, 0xc42041a500, 0xc420090080, 0x1, 0x1)
/home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:62 +0x4f2
github.com/grailbio/reflow/tool.(*Cmd).Main(0xc420418000)
/home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/main.go:298 +0x9fb
main.main()
/home/ubuntu/gocode/src/github.com/grailbio/reflow/cmd/reflow/main.go:62 +0x3fc
The stack trace you are seeing is some bug/inconsistency in reflow code.
Can you tell me if ls -l $HOME/.reflow/runs/9a2b6a27*
returns
something?
On Thu, Jun 28, 2018 at 12:03 PM Olga Botvinnik notifications@github.com wrote:
Thanks @prasadgopal https://github.com/prasadgopal ! For these jobs, I'm getting a memory error when I reflow info a particular job. I'm running 52k jobs on a m4.large https://ec2instances.info/?filter=m4.large machine which has 8 GB ram. Should I be on a machine with more RAM?
✘ Thu 28 Jun - 18:49 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 reflow listbatch | head A1-B002427-3_39_F-1-1_trim=false_scaled=100 9a2b6a27 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1000 cd1f2346 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1100 f72ab8bf waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1200 0f2ed555 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1300 baa7ca80 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1400 396fb5b8 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1500 86adda8c waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1600 7eb6ac08 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1700 c4767cd0 waiting A1-B002427-3_39_F-1-1_trim=false_scaled=1800 56ed2cf9 waiting
Thu 28 Jun - 19:01 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 reflow info 9a2b6a27 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbb8ade]
goroutine 1 [running]:github.com/grailbio/reflow/tool.(Cmd).printCacheInfo(0xc420418000, 0xf67200, 0xc4200af440, 0xf60980, 0xc420192500, 0x5, 0x276a2b9a, 0x0, 0x0, 0x0, ...) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:202 +0xcegithub.com/grailbio/reflow/tool.(Cmd).info(0xc420418000, 0xf67180, 0xc42041a500, 0xc420090080, 0x1, 0x1) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/info.go:62 +0x4f2github.com/grailbio/reflow/tool.(*Cmd).Main(0xc420418000) /home/ubuntu/gocode/src/github.com/grailbio/reflow/tool/main.go:298 +0x9fb main.main() /home/ubuntu/gocode/src/github.com/grailbio/reflow/cmd/reflow/main.go:62 +0x3fc
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401140026, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q_nDWWty8iRoFlYublMG6dC0g8Zzks5uBShngaJpZM4U4uwZ .
Yeah it shows this:
✘ Thu 28 Jun - 20:23 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 ls -l $HOME/.reflow/runs/9a2b6a27*
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog
-rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock
Is this normal?
Also that folder has TONS of files!!!
Thu 28 Jun - 20:23 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 ls -1 $HOME/.reflow/runs| wc -l
450993
It is missing a .json file which reflow info
looks for to find the
state of the runs. I'll need to dig a little in code to see why the .json
isn't being written out.
Yes, the runs dir has a folder for each run.
On Thu, Jun 28, 2018 at 1:24 PM Olga Botvinnik notifications@github.com wrote:
Yeah it shows this:
✘ Thu 28 Jun - 20:23 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 ls -l $HOME/.reflow/runs/9a2b6a27* -rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog -rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock
Is this normal?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401161845, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7ITViE503O9CyT-3o792q7WWJ6sks5uBTtxgaJpZM4U4uwZ .
I meant each run produces a bunch of files.
On Thu, Jun 28, 2018 at 1:29 PM Prasad Gopal pgopal@grailbio.com wrote:
It is missing a .json file which
reflow info
looks for to find the state of the runs. I'll need to dig a little in code to see why the .json isn't being written out.Yes, the runs dir has a folder for each run.
On Thu, Jun 28, 2018 at 1:24 PM Olga Botvinnik notifications@github.com wrote:
Yeah it shows this:
✘ Thu 28 Jun - 20:23 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 ls -l $HOME/.reflow/runs/9a2b6a27* -rw-rw-r-- 1 ubuntu ubuntu 0 Jun 28 18:46 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.execlog -rwxrwxr-x 1 ubuntu ubuntu 0 Jun 28 18:43 /home/ubuntu/.reflow/runs/9a2b6a27a3daf4956489a6db3ce392c993d2b7b682796838b9b634c7a88eb2f1.lock
Is this normal?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401161845, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q7ITViE503O9CyT-3o792q7WWJ6sks5uBTtxgaJpZM4U4uwZ .
should I remove the $HOME/.reflow/runs
folder to force these jobs to restart?
I really don't understand what's going on .. reflow is definitely running:
Thu 28 Jun - 20:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 ps all
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 1306 1 20 0 14472 1580 - Ss+ ttyS0 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
4 0 1307 1 20 0 14656 1428 - Ss+ tty1 0:00 /sbin/agetty --noclear tty1 linux
0 1000 5629 4891 20 0 21288 3324 wait Ss pts/2 0:00 /bin/bash
0 1000 5639 5629 20 0 42864 3184 sigsus S pts/2 0:00 zsh
0 1000 5708 5639 20 0 361732 53448 ep_pol Sl+ pts/2 0:34 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/jupyter-notebook --ip=*
0 1000 8802 4891 20 0 21288 3412 wait Ss pts/3 0:00 /bin/bash
0 1000 8812 8802 20 0 45312 5336 sigsus S pts/3 0:01 zsh
0 1000 10231 4891 20 0 21288 3308 wait Ss pts/1 0:00 /bin/bash
0 1000 10241 10231 20 0 45212 3452 sigsus S pts/1 0:01 zsh
0 1000 10746 10745 20 0 21392 5196 wait Ss pts/0 0:00 -bash
0 1000 10760 10746 20 0 25772 2840 - S+ pts/0 0:00 screen -x
0 1000 11702 10241 20 0 696928 95704 poll_s Sl+ pts/1 2:13 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython
0 1000 11886 4891 20 0 21288 3324 wait Ss pts/4 0:00 /bin/bash
0 1000 11909 11886 20 0 45404 5132 sigsus S pts/4 0:01 zsh
0 1000 12857 8812 20 0 8023124 6951520 futex_ Sl+ pts/3 8:12 reflow runbatch -retry
0 1000 13235 11909 20 0 40536 3740 poll_s S+ pts/4 0:00 top
0 1000 13292 28963 20 0 27636 1392 - R+ pts/5 0:00 ps all
0 1000 28953 4891 20 0 21288 3256 wait Ss pts/5 0:00 /bin/bash
0 1000 28963 28953 20 0 45404 4940 sigsus S pts/5 0:01 zsh
0 1000 30611 4891 20 0 21288 3328 wait_w Ss+ pts/6 0:00 /bin/bash
0 1000 30621 4891 20 0 21288 3256 wait_w Ss+ pts/7 0:00 /bin/bash
It's using a lot of CPU:
top - 20:50:09 up 4 days, 4:42, 7 users, load average: 1.42, 1.18, 0.66
Tasks: 139 total, 2 running, 137 sleeping, 0 stopped, 0 zombie
%Cpu(s): 86.1 us, 4.4 sy, 0.0 ni, 9.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8173840 total, 180304 free, 7330220 used, 663316 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 439872 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12857 ubuntu 20 0 8023124 6.629g 15264 S 105.3 85.0 8:50.93 reflow
11702 ubuntu 20 0 696928 95704 4260 S 0.7 1.2 2:13.96 ipython
13535 ubuntu 20 0 27236 8600 5344 R 0.7 0.1 0:00.02 aws
10745 ubuntu 20 0 92788 3140 2204 S 0.3 0.0 0:00.30 sshd
13235 ubuntu 20 0 40536 3740 3112 R 0.3 0.0 0:00.27 top
1 root 20 0 37944 3628 1692 S 0.0 0.0 0:11.48 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:01.38 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 0:16.38 rcu_sched
But there's no olgabot@localhost (reflow)
instance launched in the EC2 management console (yes I'm in the right region) and all of the jobs are waiting
:
Thu 28 Jun - 20:49 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 reflow listbatch | grep -v waiting | head
[no output]
Does reflow batchinfo anything more? Is there a way I can reproduce what you are doing on my side?
On Thu, Jun 28, 2018 at 1:52 PM Olga Botvinnik notifications@github.com wrote:
I really don't understand what's going on .. reflow is definitely running:
Thu 28 Jun - 20:40 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 ps all F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1306 1 20 0 14472 1580 - Ss+ ttyS0 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220 4 0 1307 1 20 0 14656 1428 - Ss+ tty1 0:00 /sbin/agetty --noclear tty1 linux 0 1000 5629 4891 20 0 21288 3324 wait Ss pts/2 0:00 /bin/bash 0 1000 5639 5629 20 0 42864 3184 sigsus S pts/2 0:00 zsh 0 1000 5708 5639 20 0 361732 53448 ep_pol Sl+ pts/2 0:34 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/jupyter-notebook --ip=* 0 1000 8802 4891 20 0 21288 3412 wait Ss pts/3 0:00 /bin/bash 0 1000 8812 8802 20 0 45312 5336 sigsus S pts/3 0:01 zsh 0 1000 10231 4891 20 0 21288 3308 wait Ss pts/1 0:00 /bin/bash 0 1000 10241 10231 20 0 45212 3452 sigsus S pts/1 0:01 zsh 0 1000 10746 10745 20 0 21392 5196 wait Ss pts/0 0:00 -bash 0 1000 10760 10746 20 0 25772 2840 - S+ pts/0 0:00 screen -x 0 1000 11702 10241 20 0 696928 95704 polls Sl+ pts/1 2:13 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython 0 1000 11886 4891 20 0 21288 3324 wait Ss pts/4 0:00 /bin/bash 0 1000 11909 11886 20 0 45404 5132 sigsus S pts/4 0:01 zsh 0 1000 12857 8812 20 0 8023124 6951520 futex Sl+ pts/3 8:12 reflow runbatch -retry 0 1000 13235 11909 20 0 40536 3740 poll_s S+ pts/4 0:00 top 0 1000 13292 28963 20 0 27636 1392 - R+ pts/5 0:00 ps all 0 1000 28953 4891 20 0 21288 3256 wait Ss pts/5 0:00 /bin/bash 0 1000 28963 28953 20 0 45404 4940 sigsus S pts/5 0:01 zsh 0 1000 30611 4891 20 0 21288 3328 wait_w Ss+ pts/6 0:00 /bin/bash 0 1000 30621 4891 20 0 21288 3256 wait_w Ss+ pts/7 0:00 /bin/bash
It's using a lot of CPU:
top - 20:50:09 up 4 days, 4:42, 7 users, load average: 1.42, 1.18, 0.66 Tasks: 139 total, 2 running, 137 sleeping, 0 stopped, 0 zombie %Cpu(s): 86.1 us, 4.4 sy, 0.0 ni, 9.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8173840 total, 180304 free, 7330220 used, 663316 buff/cache KiB Swap: 0 total, 0 free, 0 used. 439872 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12857 ubuntu 20 0 8023124 6.629g 15264 S 105.3 85.0 8:50.93 reflow 11702 ubuntu 20 0 696928 95704 4260 S 0.7 1.2 2:13.96 ipython 13535 ubuntu 20 0 27236 8600 5344 R 0.7 0.1 0:00.02 aws 10745 ubuntu 20 0 92788 3140 2204 S 0.3 0.0 0:00.30 sshd 13235 ubuntu 20 0 40536 3740 3112 R 0.3 0.0 0:00.27 top 1 root 20 0 37944 3628 1692 S 0.0 0.0 0:11.48 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:01.38 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0.0 0.0 0:16.38 rcu_sched
But there's no olgabot@localhost (reflow) instance launched in the EC2 management console (yes I'm in the right region) and all of the jobs are waiting:
Thu 28 Jun - 20:49 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀ ubuntu@ip-172-31-42-179 reflow listbatch | grep -v waiting | head [no output]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401169548, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q37Q3AJMCGM3FS8_wb7U7WJwsa7Xks5uBUH7gaJpZM4U4uwZ .
wait it seems to be working now! I removed the runs folder, the state files, and all of the logs, and am running reflow runbatch -reset -retry -gc
and now I see some ubuntu@localhost (reflow)
instances running (since ubuntu
is the username on my EC2 machine). Adding -user olgabot
makes a weird error about permissions, so is there a way to have the machines be named olgabot@localhost (reflow)
instead?
reflow batchinfo
showed this:
Thu 28 Jun - 17:21 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master ✔ 4☀
ubuntu@ip-172-31-42-179 reflow batchinfo
run A1-B002427-3_39_F-1-1_trim=false_scaled=100: 12af8e26
log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=100
run A1-B002427-3_39_F-1-1_trim=false_scaled=1000: 9568d8b3
log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1000
run A1-B002427-3_39_F-1-1_trim=false_scaled=1100: 6bcd3d1a
log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1100
run A1-B002427-3_39_F-1-1_trim=false_scaled=1200: 8508d407
log: /home/ubuntu/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison/log.A1-B002427-3_39_F-1-1_trim=false_scaled=1200
In case it's useful to anyone, here are some of the failure modes I'm getting:
This is on a t2.2xlarge
by the way
This looks like I hit Ctrl-C to kill the job which makes me think it ran out of memory on the instance
reflow: run A20-B000971-3_39_F-1-1_trim=false_scaled=700: error: put fc4d046a8e5814c3 sha256:f813c2806d624b66f0fc40377191f20a3fa052c66bd4ced79d4744bfc88abcf7 execconfig extern url s3://olgabot-maca/facs/sourmash_dna-only_trim=false_scaled=700/A20-B000971-3_39_F-1-1.signature resources {}: operation not supported: zombie alloc
ec2cluster: 10 instances: c5.9xlarge:5,m4.16xlarge:5 (<=$23.7/hr), total{mem:1.5TiB cpu:500 disk:2.4TiB intel_avx:500 intel_avx2:500 intel_avx512:180}, waiting{mem:207.4TiB cpu:6638 disk:3.2TiB}, pending{mem:0B cpu:0 disk:0B}
allocate {mem:64.0GiB cpu:2 disk:1.0GiB}[51971]: provisioning new instance 11m43s-19m35s
[1] 6369 killed reflow runbatch -retry -reset -gc
✘ Fri 29 Jun - 20:20 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master 4☀ 2‒
ubuntu@ip-172-31-42-179
reflow: ec2cluster: error while waiting for offers: offers ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp 34.220.249.100:9000: i/o timeout
reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp: lookup ec2-18-237-239-92.us-west-2.compute.amazonaws.com on 172.31.0.2:53: dial udp 172.31.0.2:53: i/o timeout
Job got killed most likely because reflow ran out of memory. I think the 52k samples is possibly what is causing it to OOM. How much memory do you have on the host machine where you invoking reflow?
IO timeouts should be ok. reflow will recover from those errors.
On Fri, Jun 29, 2018 at 3:02 PM Olga Botvinnik notifications@github.com wrote:
In case it's useful to anyone, here are some of the failure modes I'm getting: job killed
This looks like I hit Ctrl-C to kill the job which makes me think it ran out of memory on the instance
reflow: run A20-B000971-3_39_F-1-1_trim=false_scaled=700: error: put fc4d046a8e5814c3 sha256:f813c2806d624b66f0fc40377191f20a3fa052c66bd4ced79d4744bfc88abcf7 execconfig extern url s3://olgabot-maca/facs/sourmash_dna-only_trim=false_scaled=700/A20-B000971-3_39_F-1-1.signature resources {}: operation not supported: zombie alloc
ec2cluster: 10 instances: c5.9xlarge:5,m4.16xlarge:5 (<=$23.7/hr), total{mem:1.5TiB cpu:500 disk:2.4TiB intel_avx:500 intel_avx2:500 intel_avx512:180}, waiting{mem:207.4TiB cpu:6638 disk:3.2TiB}, pending{mem:0B cpu:0 disk:0B}
allocate {mem:64.0GiB cpu:2 disk:1.0GiB}[51971]: provisioning new instance 11m43s-19m35s
[1] 6369 killed reflow runbatch -retry -reset -gc
✘ Fri 29 Jun - 20:20 ~/kmer-hashing/sourmash/maca/facs_v5_1000cell_dna-only_scaled_trim_comparison origin ☊ master 4☀ 2‒
ubuntu@ip-172-31-42-179
IO timeouts
reflow: ec2cluster: error while waiting for offers: offers ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-34-220-249-100.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp 34.220.249.100:9000: i/o timeout reflow: ec2cluster: error while waiting for offers: offers ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000: network error: Get https://ec2-18-237-239-92.us-west-2.compute.amazonaws.com:9000/v1/offers%2F: dial tcp: lookup ec2-18-237-239-92.us-west-2.compute.amazonaws.com on 172.31.0.2:53: dial udp 172.31.0.2:53: i/o timeout
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/48#issuecomment-401485102, or mute the thread https://github.com/notifications/unsubscribe-auth/AfC0Q1m16J_t4aGP5VdN8FHu-n0-5gojks5uBqPlgaJpZM4U4uwZ .
This is on a t2.2xlarge by the way
This is the instance type where the reflow
process is running, or Reflow is using instances of this type? I agree with @prasadgopal -- this error seems to indicate you were killed by the kernel.
This is the instance type where the reflow
process was running. It was indeed "OOM" in the kern.log
Running on a machine with more memory solved my problem. I'm now running reflow on-prem with a beefy 128-core, 2TB RAM machine and haven't had any problems yet :) I do think users would appreciate showing an out of memory error from Reflow so they can more quickly solve their problem rather than restarting a bunch of jobs on a machine that can't handle all of them like I did.
Is the memory usage a reflection of the DAG used to create the jobs? Or caching the results in between? Because on the smaller machine, Reflow would run fine for about an hour and then crapped out after some jobs had finished, and that's inconsistent to me with the creation of the DAG but something that's happening to the cached results as jobs get done.
Hello, I ran this workflow on this batch (52k samples, stderr + stdout are saved there too) and the job failed on my
m4.large
EC2 instance. Some jobs finished (below) but most didn't.However when I check
reflow listbatch
, there's still a bunch of jobs "waiting" and I'm not sure how to terminate or restart them.Looking into the documentation, it seems that
reflow repair
is what I should use here but I don't understand whatpath
is here. I thought it was the path to the current batch folder but none of what I tried worked:Can you let me know how to proceed? Thank you! Warmest, Olga