BW Error when running 601 Node run

jacobrh91 commented 7 years ago

When running a 601 Node Swift/T run analyzing 1200 samples (with two tasks running in parallel on each node), a failure occurred. There was a truncated alignment file, SRR1293800, that was detected.

jacobrh91 commented 7 years ago

Swift/T 601 Node BW run investigation:

An error in a samtools view -c command killed a task and then the swift/t run itself. Job ID: 7403567.bw

The file truly is truncated; SRR1293800 is the ID.

The file sizes for all of the aligned bams and when they finished are below:

-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1247973.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1248960.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1259003.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1283156.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1287757.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1291478.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1302319.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1422593.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1463272.noDedups.bam
-rw------- 1 jacobrh ILL_baib 0 Aug 29 14:23 SRR1533547.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.1G Aug 29 14:23 SRR1501853.noDedups.bam
-rw------- 1 jacobrh ILL_baib 128M Aug 29 14:23 SRR1280055.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.2G Aug 29 14:23 SRR1186145.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.2G Aug 29 14:23 SRR1260656.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.2G Aug 29 14:23 SRR1267708.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.2G Aug 29 14:23 SRR1284028.noDedups.bam
-rw------- 1 jacobrh ILL_baib 12G Aug 29 14:23 SRR1293800.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.2G Aug 29 14:23 SRR1422619.noDedups.bam
-rw------- 1 jacobrh ILL_baib 12M Aug 29 14:23 SRR1276009.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.3G Aug 29 14:23 SRR1186172.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.3G Aug 29 14:23 SRR1287816.noDedups.bam
-rw------- 1 jacobrh ILL_baib 14G Aug 29 14:19 SRR1275961.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1210815.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1222904.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1280559.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1313052.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1427543.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1533495.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.4G Aug 29 14:23 SRR1533543.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.5G Aug 29 14:23 SRR1287779.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.5G Aug 29 14:23 SRR1303897.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.5G Aug 29 14:23 SRR1423668.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.6G Aug 29 14:23 SRR1248980.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.6G Aug 29 14:23 SRR1250622.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.6G Aug 29 14:23 SRR1289675.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.6G Aug 29 14:23 SRR1304080.noDedups.bam
-rw------- 1 jacobrh ILL_baib 172M Aug 29 14:23 SRR1275553.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.7G Aug 29 14:23 SRR1280045.noDedups.bam
-rw------- 1 jacobrh ILL_baib 188M Aug 29 14:23 SRR1533472.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.8G Aug 29 14:23 SRR1279231.noDedups.bam
-rw------- 1 jacobrh ILL_baib 1.9G Aug 29 14:23 SRR1531508.noDedups.bam
-rw------- 1 jacobrh ILL_baib 203M Aug 29 14:23 SRR1533537.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.1G Aug 29 14:23 SRR1257620.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.1G Aug 29 14:23 SRR1501879.noDedups.bam
-rw------- 1 jacobrh ILL_baib 235M Aug 29 14:23 SRR1263110.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.3G Aug 29 14:23 SRR1303852.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.3G Aug 29 14:23 SRR1533476.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.3G Aug 29 14:23 SRR1533511.noDedups.bam
-rw------- 1 jacobrh ILL_baib 243M Aug 29 14:23 SRR1279479.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.4G Aug 29 14:23 SRR1232209.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.4G Aug 29 14:23 SRR1248235.noDedups.bam
-rw------- 1 jacobrh ILL_baib 251M Aug 29 14:23 SRR1253922.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.5G Aug 29 14:23 SRR1262881.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.5G Aug 29 14:23 SRR1501718.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1198054.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1247993.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1254226.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1275881.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1303858.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1533549.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.6G Aug 29 14:23 SRR1533573.noDedups.bam
-rw------- 1 jacobrh ILL_baib 279M Aug 29 14:23 SRR1302250.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.7G Aug 29 14:23 SRR1198018.noDedups.bam
-rw------- 1 jacobrh ILL_baib 287M Aug 29 14:23 SRR1239432.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.8G Aug 29 14:23 SRR1222847.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.8G Aug 29 14:23 SRR1275825.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.9G Aug 29 14:23 SRR1264786.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.9G Aug 29 14:23 SRR1267637.noDedups.bam
-rw------- 1 jacobrh ILL_baib 2.9G Aug 29 14:23 SRR1291474.noDedups.bam
-rw------- 1 jacobrh ILL_baib 307M Aug 29 14:23 SRR1279557.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.0G Aug 29 14:23 SRR1239395.noDedups.bam
-rw------- 1 jacobrh ILL_baib 323M Aug 29 14:23 SRR1463284.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.2G Aug 29 14:23 SRR1262876.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.3G Aug 29 14:23 SRR1267348.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.3G Aug 29 14:23 SRR1280571.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.3G Aug 29 14:23 SRR1286152.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.5G Aug 29 14:23 SRR1276134.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.6G Aug 29 14:23 SRR1248029.noDedups.bam
-rw------- 1 jacobrh ILL_baib 379M Aug 29 14:23 SRR1335748.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.7G Aug 29 14:23 SRR1271936.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.7G Aug 29 14:23 SRR1302544.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.8G Aug 29 14:23 SRR1222759.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.8G Aug 29 14:23 SRR1275659.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.8G Aug 29 14:23 SRR1279058.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.9G Aug 29 14:23 SRR1268849.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.9G Aug 29 14:23 SRR1269049.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.9G Aug 29 14:23 SRR1279696.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.9G Aug 29 14:23 SRR1284026.noDedups.bam
-rw------- 1 jacobrh ILL_baib 3.9G Aug 29 14:23 SRR1336764.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.0G Aug 29 14:23 SRR1267877.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.1G Aug 29 14:23 SRR1463308.noDedups.bam
-rw------- 1 jacobrh ILL_baib 426M Aug 29 14:23 SRR1289679.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.2G Aug 29 14:23 SRR1284149.noDedups.bam
-rw------- 1 jacobrh ILL_baib 438M Aug 29 14:23 SRR1263135.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.3G Aug 29 14:23 SRR1210921.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.3G Aug 29 14:23 SRR1275815.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.4G Aug 29 14:23 SRR1290405.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.4G Aug 29 14:23 SRR1501889.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.4G Aug 29 14:23 SRR1533496.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.5G Aug 29 14:23 SRR1262925.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.5G Aug 29 14:23 SRR1285060.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.5G Aug 29 14:23 SRR1531509.noDedups.bam
-rw------- 1 jacobrh ILL_baib 470M Aug 29 14:23 SRR1247942.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.7G Aug 29 14:23 SRR1260711.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.7G Aug 29 14:23 SRR1291375.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.8G Aug 29 14:23 SRR1262869.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.8G Aug 29 14:23 SRR1284014.noDedups.bam
-rw------- 1 jacobrh ILL_baib 4.8G Aug 29 14:23 SRR1291446.noDedups.bam
-rw------- 1 jacobrh ILL_baib 5.0G Aug 29 14:23 SRR1287763.noDedups.bam
-rw------- 1 jacobrh ILL_baib 514M Aug 29 14:23 SRR1291566.noDedups.bam
-rw------- 1 jacobrh ILL_baib 5.1G Aug 29 14:23 SRR1248137.noDedups.bam
-rw------- 1 jacobrh ILL_baib 5.1G Aug 29 14:23 SRR1284992.noDedups.bam
-rw------- 1 jacobrh ILL_baib 5.2G Aug 29 14:23 SRR1303935.noDedups.bam
-rw------- 1 jacobrh ILL_baib 5.5G Aug 29 14:23 SRR1250702.noDedups.bam
-rw------- 1 jacobrh ILL_baib 566M Aug 29 14:23 SRR1198179.noDedups.bam
-rw------- 1 jacobrh ILL_baib 605M Aug 29 14:23 SRR1222883.noDedups.bam
-rw------- 1 jacobrh ILL_baib 605M Aug 29 14:23 SRR1279148.noDedups.bam
-rw------- 1 jacobrh ILL_baib 6.0G Aug 29 14:23 SRR1302532.noDedups.bam
-rw------- 1 jacobrh ILL_baib 613M Aug 29 14:23 SRR1260627.noDedups.bam
-rw------- 1 jacobrh ILL_baib 6.1G Aug 29 14:23 SRR1280310.noDedups.bam
-rw------- 1 jacobrh ILL_baib 6.2G Aug 29 14:23 SRR1290401.noDedups.bam
-rw------- 1 jacobrh ILL_baib 637M Aug 29 14:23 SRR1232136.noDedups.bam
-rw------- 1 jacobrh ILL_baib 637M Aug 29 14:23 SRR1248972.noDedups.bam
-rw------- 1 jacobrh ILL_baib 6.3G Aug 29 14:23 SRR1533539.noDedups.bam
-rw------- 1 jacobrh ILL_baib 6.5G Aug 29 14:23 SRR1275745.noDedups.bam
-rw------- 1 jacobrh ILL_baib 661M Aug 29 14:23 SRR1271929.noDedups.bam
-rw------- 1 jacobrh ILL_baib 669M Aug 29 14:23 SRR1222749.noDedups.bam
-rw------- 1 jacobrh ILL_baib 697M Aug 29 14:23 SRR1254284.noDedups.bam
-rw------- 1 jacobrh ILL_baib 72M Aug 29 14:23 SRR1279133.noDedups.bam
-rw------- 1 jacobrh ILL_baib 7.3G Aug 29 14:23 SRR1291597.noDedups.bam
-rw------- 1 jacobrh ILL_baib 7.3G Aug 29 14:23 SRR1295897.noDedups.bam
-rw------- 1 jacobrh ILL_baib 745M Aug 29 14:23 SRR1222826.noDedups.bam
-rw------- 1 jacobrh ILL_baib 8.1G Aug 29 14:23 SRR1267732.noDedups.bam
-rw------- 1 jacobrh ILL_baib 876M Aug 29 14:23 SRR1291456.noDedups.bam
-rw------- 1 jacobrh ILL_baib 888M Aug 29 14:23 SRR1210916.noDedups.bam
-rw------- 1 jacobrh ILL_baib 888M Aug 29 14:23 SRR1291538.noDedups.bam
-rw------- 1 jacobrh ILL_baib 8.8G Aug 29 14:23 SRR1279909.noDedups.bam
-rw------- 1 jacobrh ILL_baib 9.0G Aug 29 14:23 SRR1279623.noDedups.bam
-rw------- 1 jacobrh ILL_baib 928M Aug 29 14:23 SRR1291492.noDedups.bam
-rw------- 1 jacobrh ILL_baib 96M Aug 29 14:23 SRR1289721.noDedups.bam
-rw------- 1 jacobrh ILL_baib 9.7G Aug 29 14:23 SRR1210934.noDedups.bam

It appears that the job was killed at 14:23. A tracejob on this ID confirms this Starttime: 08/29/2017 14:02:35 Endtime: 08/29/2017 14:24:40

Only one alignment appears to have been completed: -rw------- 1 jacobrh ILL_baib 14G Aug 29 14:19 SRR1275961.noDedups.bam

This file is not truncated

Another point to note is that although we requested 601 nodes when analyzing 1200 samples with 2 "programs" per node (Verified that is was written correctly in the runfile), only around 140 alignment files are present in the output folder

jacobrh91 commented 7 years ago

Maybe I am not using enough TURBINE_SERVERS to coordinate the execution of so many tasks (the default is 1 I believe).

0.000 WORK TYPES: WORK 0.000 WORKERS: 1201 RANKS: 0 - 1200 0.000 SERVERS: 1 RANKS: 1201 - 1201 0.000 WORK WORKERS: 1201 RANKS: 0 - 1200

jacobrh91 commented 7 years ago

After running the pipeline with 610 nodes and specifying 20 ADLB_SERVERS, I found that it now failed during a call to mkdir when making the timinglog directory in the tmp location. So I added the following line to the VariantCalling.swift source file:

mkdir(strcat(variables["TMPDIR"], "/timinglogs")) =>

I retried running the pipeline and now a mkdir error occurs, but now it happens here:

CAUGHT ERROR:^M
app execution failed on: nid24730 ^M
 shell: Command failed with exit code: 139^M
    invoked from within^M
"c::sync_exec $stdin_src  $stdout_dst $stderr_dst $cmd {*}$args " ^M
 command: mkdir^M
    while executing^M
"error $msg"^M
    (procedure "turbine_error" line 8)^M
    invoked from within^M
"turbine_error "app execution failed" on: [ c_utils::hostname ]  "\n $msg" "\n command: $cmd""^M
    (procedure "app_error" line 13)^M
    invoked from within^M
"app_error $tries $options $cmd {*}$args"^M
    (procedure "turbine::exec_external" line 32)^M
    invoked from within^M
"turbine::exec_external "mkdir" [ dict create ] "-p" ${v:AlignDir}"^M
    (procedure "mkdir-app-leaf1-3" line 4)^M
    invoked from within^M
"mkdir-app-leaf1-3 /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1366248/align/"^M
Turbine worker task error in: mkdir-app-leaf1-3 /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1366248/align/^M

To me this suggests that the error occurs when making the directory at /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1366248/align/

After manually checking, this directory already exists. Swift's mkdir function is supposed to be implemented with the -p flag, which will not throw an error if the directory already exists.

j-woz commented 7 years ago

Can you remind me which installation this is? I can fix it.

jacobrh91 commented 7 years ago

I am using the installation you installed on Blue Waters.

By typing the following:

~wozniak/Public/sfw/login/swift-t/stc/bin/swift-t -v

The following output is returned:

STC: Swift-Turbine Compiler 0.7.3
     for Turbine: 1.0.0
Using Java VM:    /opt/java/jdk1.8.0_51/bin/java
Using Turbine in: /u/sciteam/wozniak/Public/sfw/login/swift-t/turbine

Turbine 1.0.0
 in: /u/sciteam/wozniak/Public/sfw/login/swift-t/turbine
 installed from: /u/sciteam/wozniak/proj/pub/login/swift-t/turbine/code
 using MPI: /mnt/a/u/sciteam/wozniak/Public/login/mpich-3.1/lib
 using Tcl: /usr/bin/tclsh8.5
 using Python:  python2.7

I appreciate your help on this! Thank you

j-woz commented 7 years ago

I just checked and this installation is using a Swift/T recent enough to have this app definition:

app (void v) mkdir(string dirname) { "mkdir" "-p" dirname; }

It is visible here: ~wozniak/Public/sfw/login/swift-t/turbine/export/unix.swift

Can you try doing a short run with environment variable TURBINE_LOG=1 That will show the command before it is executed, and may reveal the issue. We could also make this do mkdir -vp for debugging. I can do a call if you want to do screen sharing.

Note that there is a compute installation too. The difference is the MPI library used by Swift/T.

jacobrh91 commented 7 years ago

I have been running the tests with the TURBINE_LOG=1, so I should already have this info.

I've copied the preceding lines below, but since this is a run with 1200 samples being processed, there is a good chance that these commands do not correspond directly with the sample that failed (different IDs, etc), but the commands should be the same

   1.998 store: <29084>="/scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1451156/logs/SRR1451156_Alignment.log"
   1.859 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_tmp/timinglogs/ {}
   1.906 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1380081/variant/
   2.151 rule: <30348> => alignReads-async_op-eq_string 30049 30348 {file 29878 is_mapped 1} {file 13314 is_mapped 1} {22884} {10147} {10763} 29721 {file 30122 is_mapped 1} {-1}
   2.993 allocated u:threads=<39890> t:2=<38767> t:5=<40929> t:7=<39910> t:11=<39552>
   2.993 allocated t:13=<40021> t:29=<38994>
   2.993 creating reference: <-1>["CORES_PER_NODE"] <- <*38767> (string)
   2.153 shell command duration: 0.295
   1.877 allocated u:threads=<29078> t:2=<29229> t:5=<29510> t:7=<28653> t:11=<29082>
   1.877 allocated t:13=<29286> t:29=<29102>
   1.859 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_tmp/timinglogs/
   1.998 rule: <29084> <29261> <28095> <28115> <11951> <30592> => bwa_mem-app-leaf1 20086.0 20086.1 13694.0 29084 29261 28095 28115 0 1 SRR1451156 {SRR1451156  ALL     BWAMEM start    1504282873
} {file 30652 is_mapped 1} {file 13694 is_mapped 1} 11951 30592 {file 29806 is_mapped 1}
   2.268 function: alignReads
   1.876 store: <7312>=void
   1.851 shell command duration: 0.296
   2.074 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1345122/realign/
   1.751 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1428317/realign/
   2.137 rule: <29567> => alignReads-async_op-strcat-5 29567 {5031} {file 30068 is_mapped 1}
   1.783 shell command duration: 0.467
   1.877 shell command duration: 0.329
   2.291 allocated u:threads=<31679> t:2=<31902> t:5=<32271> t:7=<31908> t:11=<31699>
   2.291 allocated t:13=<31538> t:29=<31922>
   1.770 rule: <28301> => alignReads-async_op-eq_string 28304 28301 {file 29577 is_mapped 1} {file 7920 is_mapped 1} {9761} {6219} {6468} 29537 {file 30020 is_mapped 1} {-1}
   1.170 store: <27829>="@RG\tID:SRR1396467\tLB:\tPL:ILLUMINA\tPU:SRR1396467\tSM:SRR1396467\tCN:"
   2.161 rule: <30193> => alignReads-async_op-strcat-4 30296 30193 {file 30047 is_mapped 1} {10062}
   2.161 rule: <30441> => alignReads-async_op-strcat-5 30441 {10062} {file 29495 is_mapped 1}
   2.140 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1279946/variant/ {}
   1.569 shell command duration: 0.338
   1.903 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1450209/align/ {}
   2.106 allocated u:threads=<29607> t:2=<30369> t:5=<29744> t:7=<30342> t:11=<30108>
   2.106 allocated t:13=<30159> t:29=<31240>
   1.789 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1378648/align/ {}
   1.789 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1378648/align/
   2.078 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1412258/realign/
   1.899 creating reference: <-1>["CORES_PER_NODE"] <- <*30097> (string)
   1.864 creating reference: <-1>["BWAEXE"] <- <*28513> (string)
   1.864 creating reference: <-1>["BWAINDEX"] <- <*29817> (string)
   2.271 creating reference: <-1>["ALIGNERTOOL"] <- <*31136> (string)
   2.271 rule: <31136> => alignReads-async_op-eq_string 32297 31136 {file 31330 is_mapped 1} {file 22170 is_mapped 1} {22961} {18211} {18212} 32845 {file 32252 is_mapped 1} {-1}
   1.179 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/logs/
   1.818 shell command duration: 0.696
   2.185 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1291456/variant/ {}
   1.159 allocated u:mkdirSignal=<26313> t:96:1=<26818> u:threads=<27010> t:140:1=<27792> t:143:1=<26294>
   1.159 allocated t:145:1=<26333> t:147:1=<27108> t:149:1=<26353>
   1.159 array_kv_build: <27108> 1 elems, write_decr 1
   2.186 shell: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1291456/variant/
   1.819 exec: mkdir -p /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1259071/variant/ {}
CAUGHT ERROR:^M
app execution failed on: nid16561 ^M
 shell: Command failed with exit code: 139^M
    invoked from within^M
"c::sync_exec $stdin_src  $stdout_dst $stderr_dst $cmd {*}$args " ^M
 command: mkdir^M
    while executing^M
"error $msg"^M
    (procedure "turbine_error" line 8)^M
    invoked from within^M
"turbine_error "app execution failed" on: [ c_utils::hostname ]  "\n $msg" "\n command: $cmd""^M
    (procedure "app_error" line 13)^M
    invoked from within^M
"app_error $tries $options $cmd {*}$args"^M
    (procedure "turbine::exec_external" line 32)^M
    invoked from within^M
"turbine::exec_external "mkdir" [ dict create ] "-p" ${v:LogDir}"^M
    (procedure "mkdir-app-leaf1-2" line 4)^M
    invoked from within^M
"mkdir-app-leaf1-2 /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/align/ /scratch/sciteam/jacobrh/purge_exempt/..."^M
Turbine worker task error in: mkdir-app-leaf1-2 /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/align/ /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/logs/ /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/realign/ /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_output/SRR1276015/variant/ /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/610Nodes/610Node_tmp/timinglogs/ 29165^M
    invoked from within^M
"c::worker_loop $WORK_TYPE($mode) $keyword_args"^M
    (procedure "standard_worker" line 27)^M
    invoked from within^M
"standard_worker $rules $startup_cmd "^M
    (procedure "enter_mode_unchecked" line 5)^M
    invoked from within^M
"enter_mode_unchecked $rules $startup_cmd"^M
    (procedure "enter_mode" line 10)^M
    invoked from within^M
"enter_mode $rules $startup_cmd "^M

j-woz commented 7 years ago

Thanks for the log. I now realize that exit code 139 is signal 11, which means mkdir crashed. Something could be going wrong in the FS. Could you try our app retry feature: http://swift-lang.github.io/swift-t/guide.html#_app_retry

jacobrh91 commented 7 years ago

I set TURBINE_APP_RETRIES=6 and am rerunning the 1200 sample test. I'll let you know how it goes.

Thanks for your help!

jacobrh91 commented 7 years ago

So I ran the test again and got past the mkdir stages, but it now failed at a different spot. It was a app call failure, which I was experiencing earlier.

Basically, after the first major program produces output, the "checkBam" function determines whether or not that output file was empty or not. But in order to do that, we are using a tool called Samtools.

The call to Samtools returns an error stating that the file it is looking at is truncated, which has usually suggested to us that swift/t is not waiting for the output to be done writing to before trying to process it.

Here is (what I think to be) the relevant section of code:

alignedbam, tmpsamtoolsLog = samtools_view_logged(vars["SAMTOOLSEXE"], alignedsam, threads, ["-u"], sampleName);

    // Verify alignment was successful
    if ( checkBam(vars, alignedbam) ) {
                /*****
                DO STUFF
                ******/

and for reference here is the definition of the checkBam function:

(boolean success) checkBam(string vars[string], file bamFile) {                            
    // Swift/T would try to use samtools_view2 even before bamFile is read. Added explicit wait command     
    wait (bamFile) {
        int alignNum = samtools_view2(vars["SAMTOOLSEXE"], filename(bamFile));              
        success = alignNum > 0;                                         
    }
}

Is it reasonable for us to assume that Swift/T will wait for alignedbam to be fully written to before trying to use the checkBam function on it? Of course the alignedbam file was mapped to the filesystem in an earlier section of code, but I figured that is not important for the dataflow style of execution.

jacobrh91 commented 7 years ago

Another question,

In this run we are trying to analyze 1200 samples in parallel on 610 nodes with 2 processes per node (20 processes are set to be ADLB_SERVERS).

However, I've noticed that during these test runs, only around 100 or so output files are being produced (100 samples are being worked on) by the time the workflow fails.

I am assuming that the reason we are not seeing the maximum level of parallelism is because managing all of this work requires even more than 20 ADBL_SERVERS.

My first question is has anyone you guys have worked with analyzed anything at this scale before?

And secondly, is there some rough guideline that we can use to determine how many ADLB_SERVERS we should use for a given batch size?

Maybe 1 SERVER for every 50 WORKER processes or something like that?

j-woz commented 7 years ago

I think 1 server overall should be fine for your workflow- we may want to try to trace execution to see what is running when.

jacobrh91 commented 6 years ago

Update: things have moved on quite a bit since this ticket was created:

I successfully ran a test using novosort for marking duplicates on 101 nodes (with 1 ADLB_SERVER). It did not work when using Picard to mark duplicates, for reasons I never figured out

When I had the successful run on 101 nodes, I had a 1-to-1 correspondence between # of samples and # of nodes.

I tried scaling out to 601 nodes with 1200 samples (2 samples per node) marking duplicates with novosort and it failed at the HaplotypeCaller stage. The error message for the failing sample complained about not having enough memory (java heap was set to 16g, which really should be plenty)

Tried again using only 600 samples, and reran the Swift qsub creating script. It still failed and gave the same error message.

I will try again after increasing the heap size to 32g (which should obviously be way more than enough)

j-woz commented 6 years ago

Good to hear- I have more time next week so I can discuss this further or try to run it, etc.

jacobrh91 commented 6 years ago

So, we have tried a lot of things to get the pipeline to run with 600 nodes (and we even tried only 300 nodes) and could not get it to work. The details of what options we tried and what went wrong can be found in the test/BWtests folder in the repo.

For now, we are just going to put this problem on hold

j-woz commented 6 years ago

I should be able to run this over the next few days. About what fraction of the tasks fail?

jacobrh91 commented 6 years ago

That is hard to say. But whenever one fails, it kills that workflow. And even after repeated attempts, none of the runs have made it past the last stage (Haplotype Caller). And since different samples fail on different runs, we think the problem is not simply the presence of a faulty sample.

j-woz commented 6 years ago

I have made a few changes to make it runnable by others but I am not sure how long this is going to take. Currently, it is trying to make a directory in your scratch space. Can you fix this up so others can run it?

j-woz commented 6 years ago

If the main error you are getting is about Java -Xmx , then this is definitely Java exceeding the user-specified memory limit. I would try to gather the cases that cause this to happen and run them on some other system to analyze how much RAM they require.

lsmainzer commented 6 years ago

This is already done. We are running with the same java parameters as we always have with the Bash workflow on BW. Additionally, these same parameters have worked on other, non-Cray systems. The problem only arises when we scale up to lots of nodes. Smaller tests with fewer samples and nodes run fine on BW.

jacobrh91 commented 6 years ago

Hey @jmjwozniak,

So our team have been testing all of their stuff in my scratch space, as we were running out of space in our project directory. The data for the 600 samples is also stored in my /scratch space.

Although I copied those sh scripts into the Swift-T directory where you were editing them, they were originally from this location. /scratch/sciteam/jacobrh/purge_exempt/Swift_testing/601Nodes

And I just realized that, do to the nature of this data, you probably don't have the permissions necessary to test this 600 node run anyway.

@lsmainzer would it be possible to get him added to the ILL_baib group? Otherwise, there is no location we could move everything where he would have the permissions to try and test this run.

Thanks, Jacob

lsmainzer commented 6 years ago

Hi, Jacob, Justin!

Justin is already member of baib, but only Jacob and I are approved by the UIUC IRB to touch the ADSP data. There are two ways to proceed: 1) Justin makes edits and Jacob runs the workflow on the ADSP data 2) we create 600 copies of a small synthetic dataset ad Justin tests on those.

Let me know which you would prefer. -Luda

lsmainzer commented 6 years ago

on reflection, we really do not have a lot of node-hours in that allocation, and should be careful debugging at this point

I will go ahead and create the 600 copies of small data and the respective sample information sheet; Justin can debug with that, and Jacob can do the last run with the ADSP data when it has all been proven to work

ncsa / Swift-T-Variant-Calling

BW Error when running 601 Node run #22