sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Ray on HiSeq-2500-NA12878-demo-2x150 using titan.ccs.ornl.gov #197

Closed sebhtml closed 10 years ago

sebhtml commented 11 years ago

# Batch script to carry out computation on retrieved data
# PBS directives
#PBS -N HiSeq-2500-NA12878-demo-2x150-3
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

# Launch executable

cd $PBS_O_WORKDIR

#module load PrgEnv-pgi/4.1.40
#pgi/12.10.0
#module load cray-mpich2/5.6.3

#module load lsc005/Ray/2.2.0-1

#/tmp/proj/lsc005/software/lsc005/Ray/2.2.0-1/bin/Ray \

aprun -n 5008 \
./software/lsc005/Ray/2.3.0-devel-3dd4ef5304c-1/bin/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-3 \
sebhtml commented 11 years ago

Sequences

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-3/FilePartition.txt 
#File   Name    FirstSequence   LastSequence    NumberOfSequences
0   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz    0   143818692   143818693
1   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz    143818693   287637385   143818693
2   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz    287637386   437610805   149973420
3   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz    437610806   587584225   149973420
4   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz    587584226   731879531   144295306
5   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz    731879532   876174837   144295306
6   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz    876174838   1023766068  147591231
7   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz    1023766069  1171357299  147591231
sebhtml commented 11 years ago

Each titan node has: ( https://www.olcf.ornl.gov/support/system-user-guides/titan-user-guide/ )

16 cores 32 GiB ram

NVIDIA KEPLER

313 * 16 = 5008 MPI ranks

32 GiB / 16 cores = 2 GiB.

sebhtml commented 11 years ago

Latency is very high:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> head HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt 
# average and mode round trip latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# MessagePassingInterfaceRank   Name    ModeLatencyInMicroseconds   AverageLatencyInMicroseconds    NumberOfExchanges
# AverageForAllRanks: 299.679
# StandardDeviation: 31.685
0   nid12147    30  116 1000
1   nid12147    34  313 1000
2   nid12147    148 279 1000
3   nid12147    184 298 1000
4   nid12147    28  287 1000
5   nid12147    30  301 1000
sebhtml commented 11 years ago

memory usage is at 3 GiB+ when Ray starts (?)

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep memory  HiSeq-2500-NA12878-demo-2x150-3.o1732882|head
Rank 77: assembler memory usage: 3251836 KiB
Rank 78: assembler memory usage: 3251836 KiB
Rank 77: assembler memory usage: 3317568 KiB
Rank 78: assembler memory usage: 3317568 KiB
Rank 63: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3251836 KiB
Rank 3861: assembler memory usage: 3251836 KiB
Rank 1645: assembler memory usage: 3251836 KiB
Rank 1639: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3317568 KiB
sebhtml commented 11 years ago

Every machine has 16 MPI ranks:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep -v ^# HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt | awk '{print $2}'|sort|uniq -c|wc -l
313
sebhtml commented 11 years ago

error messages:


MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27853.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 158 nr_overcommit 16154 resv 0 surplus 158
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27856.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 165 nr_overcommit 16154 resv 0 surplus 165
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.24160.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 173 nr_overcommit 16154 resv 0 surplus 173
sebhtml commented 11 years ago

report info at tick = 0 and add VmRSS. to -debug

sebhtml commented 11 years ago

Add all of these on Linux:

VmPeak: 108964 kB VmSize: 108960 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 872 kB VmRSS: 872 kB VmData: 196 kB VmStk: 140 kB VmExe: 132 kB VmLib: 1992 kB VmPTE: 60 kB VmSwap: 0 kB

sebhtml commented 11 years ago

To build it:

module purge module load PrgEnv-intel/4.1.40 module load cray-mpich2/5.6.3 make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y clean make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

sebhtml commented 11 years ago

iteration 4:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742436

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> showq | grep 1742436
1742436             sebhtml       Idle  5008     3:00:00  Thu Sep 26 15:53:42
sebhtml commented 11 years ago

Needs -debug:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742711
sebhtml commented 11 years ago

iteration 5:

Carlos P. Sosa told me to use aprun -n 5008 -N 16

titan> cat HiSeq-2500-NA12878-demo-2x150-5.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-5
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -N 16 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-5 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-5.sh
1747928
sebhtml commented 10 years ago

With -debug:

titan> cat HiSeq-2500-NA12878-demo-2x150-7.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-7
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
#-debug \

aprun -n 5008 -N 16 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-7 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-7.sh
1763643
sebhtml commented 10 years ago

In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph.

titan> cat HiSeq-2500-NA12878-demo-2x150-8/ElapsedTime.txt 
#Step   Date    Elapsed time    Since Beginning
Network testing 2013-10-26T22:03:57 12 seconds  12 seconds
Counting sequences to assemble  2013-10-26T22:20:12 16 minutes, 15 seconds  16 minutes, 27 seconds
Sequence loading    2013-10-26T23:28:44 1 hours, 8 minutes, 32 seconds  1 hours, 24 minutes, 59 seconds
K-mer counting  2013-10-26T23:34:33 5 minutes, 49 seconds   1 hours, 30 minutes, 48 seconds
Coverage distribution analysis  2013-10-26T23:34:40 7 seconds   1 hours, 30 minutes, 55 seconds
Graph construction  2013-10-26T23:44:11 9 minutes, 31 seconds   1 hours, 40 minutes, 26 seconds
Null edge purging   2013-10-26T23:46:01 1 minutes, 50 seconds   1 hours, 42 minutes, 16 seconds
Selection of optimal read markers   2013-10-27T00:03:12 17 minutes, 11 seconds  1 hours, 59 minutes, 27 seconds
Detection of assembly seeds 2013-10-27T00:09:52 6 minutes, 40 seconds   2 hours, 6 minutes, 7 seconds
Estimation of outer distances for paired reads  2013-10-27T00:11:58 2 minutes, 6 seconds    2 hours, 8 minutes, 13 seconds
Bidirectional extension of seeds    2013-10-27T02:07:27 1 hours, 55 minutes, 29 seconds 4 hours, 3 minutes, 42 seconds

As expected, the merging must be improved.

titan> tail HiSeq-2500-NA12878-demo-2x150-8/NumberOfSequences.txt 
    FilePath: HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz
    NumberOfSequences: 147591231
    FirstSequence: 1023766069
    LastSequence: 1171357299

Summary
    NumberOfSequences: 1171357300
    FirstSequence: 0
    LastSequence: 1171357299

Let's do a bigger job now !!!

sebhtml commented 10 years ago

script for the 4 hours incomplete run:

titan> cat HiSeq-2500-NA12878-demo-2x150-8.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-8
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
# 313 * 8 * 1 = 2504
#-debug \

aprun -n 2504 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-8 \
sebhtml commented 10 years ago

with 3750 nodes, I can run 24 hours !

https://www.olcf.ornl.gov/kb_articles/titan-scheduling-policy/

accounting: 3750_30_24 = 2700000 (we don't have enough for this)

we have 250000 for this fall

Let's try with 3750 nodes, 8 ranks per node, with 30000 ranks.

sebhtml commented 10 years ago
titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-9
#PBS -l walltime=00:12:00:00 
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 626 * 8 = 5008

aprun -n 5008 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-9 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-9.sh
1769459
sebhtml commented 10 years ago

job -9 vanished, that's strange:

titan> showq | grep boisv titan> ls|grep HiSeq-2500-NA12878-demo-2x150-9 HiSeq-2500-NA12878-demo-2x150-9.sh

Let's resubmit as -10:

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh

PBS -N HiSeq-2500-NA12878-demo-2x150-10

PBS -l walltime=00:12:00:00

PBS -l nodes=626

PBS -A LSC005

PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008 \ ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \ -k 31 \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \ -o HiSeq-2500-NA12878-demo-2x150-10 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-10.sh 1778289

sebhtml commented 10 years ago

Waiting time:

Salut Jacques,

Mes jobs sont en attente, respectivement depuis 18 et 8 jours.

titan> showq | grep sebht 1769459 sebhtml Idle 10016 12:00:00 Mon Oct 28 14:19:19 1778289 sebhtml Idle 10016 12:00:00 Thu Nov 7 11:17:21

titan> checkjob 1769459|head job 1769459

AName: HiSeq-2500-NA12878-demo-2x150-9 State: Idle Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0 WallTime: 00:00:00 of 12:00:00 BecameEligible: Fri Nov 15 12:27:27 SubmitTime: Mon Oct 28 14:19:19 (Time Queued Total: 18:00:01:21 Eligible: 17:21:30:02)

titan> checkjob 1778289|head

job 1778289

AName: HiSeq-2500-NA12878-demo-2x150-10 State: Idle Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0 WallTime: 00:00:00 of 12:00:00 BecameEligible: Fri Nov 15 12:27:27 SubmitTime: Thu Nov 7 11:17:21 (Time Queued Total: 8:02:03:34 Eligible: 7:23:39:24)

macmanes commented 10 years ago

on Trillian (UNH Cray XE6, http://trillian-use.sr.unh.edu/index.php/Main_Page) does not like this make command

make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

It complains that -xHOST is an invalid command line flag.

Did you ever solve the latency issue- I see high latency here, too.

sebhtml commented 10 years ago

Which compiler are you using. -xHOST is with the Intel compiler I think.

sebhtml commented 10 years ago

Update for jobs -9 and -10:

Hi Jacques,

Regarding titan:

My best shot so far:

"In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph." (2013-10-28)

Last 2 jobs

However, my last 2 jobs both failed (I increased the number of cores and this highlighted the same problem in the caching subsystem of nodes).

Job: HiSeq-2500-NA12878-demo-2x150-9 # 1769459

titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh

PBS -N HiSeq-2500-NA12878-demo-2x150-9

PBS -l walltime=00:12:00:00

PBS -l nodes=626

PBS -A LSC005

PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008 \ ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \ -k 31 \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \ -o HiSeq-2500-NA12878-demo-2x150-9 \

Like last time, this is a problem with cached content in the VFS layer of Lustre.

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pa gesize-2097152/hugepagefile.MPICH.2.16799.kvs_3928360, err Cannot allocate memory

Job: HiSeq-2500-NA12878-demo-2x150-10 # 1778289

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh

PBS -N HiSeq-2500-NA12878-demo-2x150-10

PBS -l walltime=00:12:00:00

PBS -l nodes=626

PBS -A LSC005

PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008 \ ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \ -k 31 \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \ -o HiSeq-2500-NA12878-demo-2x150-10 \

Same here:

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/ pagesize-2097152/hugepagefile.MPICH.2.23067.kvs_3999472, err Cannot allocate memory

What support says about this

The ticket with ORNL people is "Re: [CCS #177295] MPICH on titan uses a lot of memory (?)".

The last response I got was from 2013-10-21:

Thanks Sebastien,

My gut tells me you're running out of memory per core. Hugepage is busting and the max size is 2GB. MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 211 nr_overcommit 16154 resv 0 surplus 211

The network is just the one to complain about it, but not necessarily the cause.

Have you tried lowering the number of MPI processes to 8/node?

FF

(I am already at 8, I think the problem is buggy caching, not memory usage by Ray).

The issue is that cached pages in the VFS wastes memory.

See below the /proc/meminfo:

[Rank 3758] Cat of /proc/meminfo [Rank 3755]: MemTotal: 33084652 kB [Rank 3755]: MemFree: 3984520 kB [Rank 3755]: Buffers: 0 kB [Rank 3755]: Cached: 22332700 kB ** [Rank 3755]: SwapCached: 0 kB [Rank 3755]: Active: 12556068 kB [Rank 3755]: Inactive: 12527116 kB [Rank 3758]: MemTotal: 33084652 kB [Rank 3755]: Active(anon): 2637848 kB [Rank 3758]: MemFree: 3984892 kB [Rank 3755]: Inactive(anon): 168920 kB [Rank 3758]: Buffers: 0 kB [Rank 3755]: Active(file): 9918220 kB [Rank 3758]: Cached: 22332700 kB [Rank 3755]: Inactive(file): 12358196 kB

That's somewhere between 22 gigabytes and 36 gigabytes wasted on cache by the operating system.

Ticket: https://github.com/sebhtml/ray/issues/197

Séb

sebhtml commented 10 years ago

Support said to try out the new storage:

https://www.olcf.ornl.gov/kb_articles/atlas-transition/

sebhtml commented 10 years ago

moving files to atlas.

titan> mv /tmp/proj/lsc005/* /lustre/atlas/proj-shared/lsc005/

sebhtml commented 10 years ago

job with atlas on titan:

titan> pwd /lustre/atlas/proj-shared/lsc005/projects/human-1-hour titan> cat HiSeq-2500-NA12878-demo-2x150-11.sh

PBS -N HiSeq-2500-NA12878-demo-2x150-11

PBS -l walltime=00:12:00:00

PBS -l nodes=626

PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008 \ ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \ -k 31 \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \ -o HiSeq-2500-NA12878-demo-2x150-11 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-11.sh 1833464

titan> showq | grep 1833464 1833464 sebhtml Idle 10016 12:00:00 Mon Jan 6 11:43:18

I think this will start in like 1 month.

sebhtml commented 10 years ago

on titan: #228

sebhtml commented 10 years ago

-11 failed because of a faulty symlink...

sebhtml commented 10 years ago

Hi Jacques,

For my Titan job, it seems that it started after the decommissioning of Spider, which was on 27 Jan 2014 I think.

There was a faulty symbolic link. although my data was on Atlas.

titan> pwd /ccs/home/sebhtml/lsc005-atlas/projects/human-1-hour titan> cat HiSeq-2500-NA12878-demo-2x150-11.e1833464 aprun: file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray not found aprun: Exiting due to errors. Application aborted

titan> readlink software lsc005/software/ titan> readlink lsc005 /tmp/proj/lsc005 titan> file /tmp/proj/lsc005 /tmp/proj/lsc005: cannot open `/tmp/proj/lsc005' (No such file or directory)

titan> file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), for GNU/Linux 2.6.4, statically linked, not stripped

sebhtml commented 10 years ago

-12

titan> vim HiSeq-2500-NA12878-demo-2x150-12.sh titan> qsub HiSeq-2500-NA12878-demo-2x150-12.sh 1863329 titan> pwd /ccs/home/sebhtml/lsc005/projects/human-1-hour

sebhtml commented 10 years ago

Corrupted files on Titan (Atlas FS):

7 out of 8 fastq files vanished (strange). This is what is left of it:

Fichiers sur Titan (il y a eu un problème sur le FS):

titan> ls -lh HiSeq-2500-NA12878-demo-2x150/*gz -rw------- 1 sebhtml lsc005 684M 2013-12-23 12:30 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz

Ce que c'est sensé être:

[boisver1@ip03-mp2 data]$ ls -lh HiSeq-2500-NA12878-demo-2x150/*gz -rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 19G Nov 22 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz -rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz

oh well...

sebhtml commented 10 years ago

Pulling from Sherbrooke to get the data again.

rsync -avzPL

/mnt/scratch_mp2/corbeil/corbeil_group/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150

sebhtml commented 10 years ago

Data on Titan:

titan> ls /lustre/atlas/proj-shared/lsc005/projects/human-1-hour/HiSeq-2500-NA12878-demo-2x150/ -lh total 145G -rw-rwxr-- 1 sebhtml sebhtml 946 2012-11-21 15:16 11 -rw-rwxr-- 1 sebhtml sebhtml 1009 2012-11-21 15:16 12 -rw-rwxr-- 1 sebhtml sebhtml 328 2012-11-22 18:44 Counts -rw-rwxr-- 1 sebhtml sebhtml 291 2012-11-22 13:52 Get.sh -rw-rwxr-- 1 sebhtml sebhtml 889 2012-11-20 00:38 RawFiles.txt -rw-r--r-- 1 sebhtml sebhtml 14 2012-11-21 13:10 README -rw-rwxr-- 1 sebhtml sebhtml 584 2012-11-22 14:06 sha1sum.txt -rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 15:17 sorted_S1_L001_R1_001.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_001.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 16:44 sorted_S1_L001_R1_002.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_002.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:17 sorted_S1_L001_R2_001.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L001_R2_001.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-22 11:22 sorted_S1_L001_R2_002.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R2_002.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:38 sorted_S1_L002_R1_001.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L002_R1_001.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:07 sorted_S1_L002_R1_002.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R1_002.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 19:16 sorted_S1_L002_R2_001.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_001.fastq.gz.log -rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:29 sorted_S1_L002_R2_002.fastq.gz -rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_002.fastq.gz.log

sebhtml commented 10 years ago

new executable /lustre/atlas/proj-shared/lsc005/software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray

sebhtml commented 10 years ago

-13

titan> pwd /ccs/home/sebhtml/lsc005/projects/human-1-hour titan> cat HiSeq-2500-NA12878-demo-2x150-13.sh

PBS -N HiSeq-2500-NA12878-demo-2x150-13

PBS -l walltime=00:12:00:00

PBS -l nodes=626

PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \

aprun -n 5008 \ ./software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray \ -k 31 \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \ -o HiSeq-2500-NA12878-demo-2x150-13 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-13.sh 1867708

titan> showq | grep sebhtml 1867708 sebhtml Idle 10016 12:00:00 Wed Feb 12 16:39:47

sebhtml commented 10 years ago

For job HiSeq-2500-NA12878-demo-2x150-13 (Atlas)

MPICH2 ERROR [Rank 4] [job id 4468454] [Wed Feb 12 19:16:49 2014] [c6-4c2s3n3] [nid02823] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepag efile.MPICH.2.2794.kvs_4468454, err Cannot allocate memory

titan> grep Cached HiSeq-2500-NA12878-demo-2x150-13.e1867708|head -n1 [Rank 4]: Cached: 14777104 kB

sebhtml commented 10 years ago

This project is finished.