Closed GoogleCodeExporter closed 8 years ago
Original comment by zack...@gmail.com
on 25 Jul 2012 at 7:14
Hi Zayed,
I'm running NovoAlign on our bisulfite data and experiencing some weirdness. On
certain lanes, it seems to run forever, CPU shows full load on all cores, so it
must be doing something:
# License file:
/home/uec-00/shared/production/software/novoalign/default/novoalign.lic
# Licensed to Norris_Cancer_Center_USC_
# novoalign -d
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
-f s_6_sequence.1.nocontam.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H
# Starting at Mon Jul 9 16:04:14 2012
# Interpreting input files as Illumina FASTQ, Casava Pipeline 1.3 to 1.7.
# Index Build Version: 2.8
# Methylation Mode ON - For bisulphite treated sequences.
# Hash length: 17
# Step size: 2
=>> PBS: job killed: walltime 720004 exceeded limit 720000
(after 1 week of running job was automatically killed)
resource consumption was as follows:
cput=2395:14:30,mem=22166744kb,vmem=26463040kb,walltime=200:00:05
(cput is commutative for all 12 cores, so nearly 100% load)
compare with another lane:
/home/uec-00/shared/production/software/novoalign/default/novoalign -d
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
-f s_1_1_sequence.1.txt s_1_2_sequence.1.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H
> s_1_1_sequence.1.txt.bam.sam
# novoalign (V2.08.01 - Build Mar 14 2012 @ 10:25:47 - A short read aligner
with qualities.
# (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
# License file:
/home/uec-00/shared/production/software/novoalign/default/novoalign.lic
# Licensed to Norris_Cancer_Center_USC_
# novoalign -d
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
-f s_1_1_sequence.1.txt s_1_2_sequence.1.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H
# Starting at Mon Jul 9 18:16:28 2012
# Interpreting input files as Illumina FASTQ, Casava Pipeline 1.3 to 1.7.
# Index Build Version: 2.8
# Methylation Mode ON - For bisulphite treated sequences.
# Hash length: 17
# Step size: 2
# Paired Reads: 124883957
# Pairs Aligned: 111496890
# Read Sequences: 249767914
# Aligned: 232294769
# Unique Alignment: 220953099
# Gapped Alignment: 11143484
# Quality Filter: 1908353
# Homopolymer Filter: 127093
# Elapsed Time: 270726.531 (sec.)
# CPU Time: 53803.8 (min.)
which finished in about 75 hours using a 12 core machine.
Second questions is: can we bring the 75 down to something lower? also, my node class of 12core, 24G men machines is in short supply, is there a way to bring down the memory usage to under 15, in which case i can use a lesser spec of server.
I am aware of the MPI version which would split up the problem onto multiple
machines, but the problem I'm having is resource utilization, not a lack of
parallelization. I need novoalign to consume way less resources given whats
available to me.
thanks,
Zack Ramjan
USC Epigenome Center
Original comment by zack...@gmail.com
on 26 Jul 2012 at 6:34
Hi Zack,
Bisulphite mode is always a lot slower than NA mode but it shouldn't take
forever.
One problem might be your -t setting, 240 seems quite high for single end
reads, your other run was paired end and had same -t 240 setting, paired end is
also faster than single end.
How long are the reads?
With high threshold Novoalign will go up to 8 mismatches per read and 15bp
indels not counting C/T mismatches. This can be a lot for 100bp reads. Also if
there's a problem aligning reads, say, contamination, or may be adapter
sequence on the reads then Novoalign might be trying to align with -t 240 on
quite a lot of reads. Novoalign tends to slow down a lot if many reads are
failing to align.
Have a look at reads to see if there is adapter sequence, if so you can trim
with the -a option.
Consider lowering threshold to say -t 150 or similar. Perhaps test a few K
reads at different -t to see effect on sensitivity.
If a large proportion of reads are not being mapped I'd double check that -b2
is correct mode, may be try -b 4 -u 8 and see what happens.
For general case thare are a few things that might help performance,
First would be reducing the -t threshold, this can cause loss of a few
alignments and you may miss alignments near large deletions or clusters of
SNPs. This is often not important for Bi-Seq and if you are doing a lot I
suggest experimenting.
You can also try different k&s settings for the index, this can change
performance quite a bit and memory. Here's some test results from another user:
Using 4000 reads on an insect genome:
Hash length
15
16
16
16
17
17
17
18
19
Step size
2
1
2
3
1
2
3
1
1
Elapsed Time (sec.)
30.846
24.690
22.576
28.485
26.782
18.736
20.504
15.595
7.78
CPU Time (min.)
9.0
5.2
6.2
9.0
4.6
5.4
6.8
4.1
3.5
The 18/1 & 19/1 indexes reduced run time quite dramatically but the indexes are
large. Lower k and higher s reduce index sizes and in above tests a 16/2 index
wasn't much slower than a 17/2. These tests were done on 4K reads, I think it
needs to be done on 100K reads.
22G for a job using a 10G index is very high, usually we get by with 2-3G more
than the index. It's possible that some reads hit very high copy number repeats
producing millions of alignment locations. Adding the -e option can help stop
this, it may be more important for single end reads, try -e 10000, this
basically stops alignment process if we've found 10,000 alignments with the
same alignment score.
If this doesn't help then I'd be happy to look at reads.
Kind Regards,
Colin
Original comment by zack...@gmail.com
on 27 Jul 2012 at 6:27
currently rerunning using colins suggestions.
Original comment by zack...@gmail.com
on 27 Jul 2012 at 8:15
Zack , could you try pulling out a small FASTQ subsection of the "fast" and
"slow" samples, to see if we can replicate difference using a smaller file for
testing (and maybe we can send to Colin). I definitely think we should use
their adapter trimming as suggested, I didn't realize we weren't.
Original comment by benb...@gmail.com
on 27 Jul 2012 at 8:18
Zack , could you try pulling out a small FASTQ subsection of the "fast" and
"slow" samples, to see if we can replicate difference using a smaller file for
testing (and maybe we can send to Colin). I definitely think we should use
their adapter trimming as suggested, I didn't realize we weren't.
Original comment by benb...@gmail.com
on 27 Jul 2012 at 8:18
ben: yes, small subsets are being run in research/zack. adapter trimming is now
on.
Original comment by zack...@gmail.com
on 27 Jul 2012 at 8:31
Yaping: I just run and try different threshold of -t in a 1M sample single end
reads. I believe this is the major problem. -t most depends on the combination
of reads length and genome size, novoalign calculate 100PE as 200bp reads
length while this time, the problematic lane's read is 100SE, which will be
treated as 100bp read. -t 240 is too high for 100SE reads as their -t
calculation formulae: 3*(read_length/2-20).
I tried different index k&s parameter to do genome index before, then test the
alignment in 1M pair end reads, it looks like time cost not change so much...
Original comment by lyping1...@gmail.com
on 27 Jul 2012 at 11:33
Hi Zack,
here is my test by using novo-align on 1M single end reads(100 SE) in different
-t threshold. It looks like after -t more than 100, the CPU time cost increases
exponentially. But the reads mapping percentage not increase so much. Maybe we
should use criteria -t 90 for 100 SE read. Do we have reads on 50SE/PE for
Bisulfite-seq? For different reads length, we may need to do the test
separately.
t-threshold mapping_percentage CPU-time_cost (min)
30 83.07% 64
50 85.59% 83
70 87.71% 101
90 88.69% 148
100 88.90% 188
120 89.39% 306
150 89.82% 540
170 90.02% 698
190 90.22% 858
220 NA Not_finished
240 NA Not_finished
For the plot, x-axis is different -t threshold. y-axis is CPU time (minutes) or
Reads_mapping_percentage
Original comment by lyping1...@gmail.com
on 28 Jul 2012 at 2:04
Attachments:
This is helpful yaping, wow cost gets very expensive... So in my perl wrapper,
I will probably have to set the -t on dynamically based upon PE and cycles. It
would have been nice if they set t dynamically using their own recommended
formula, but we can hack it in the perl script
Original comment by zack...@gmail.com
on 30 Jul 2012 at 6:12
B02WJABXX Lane 7
-t 150
101770400 + 0 in total (QC-passed reads + QC-failed reads)
4706708 + 0 duplicates
85804027 + 0 mapped (84.31%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=359:36:05,mem=18804824kb,vmem=21492556kb,walltime=31:26:01
-t 240
101770400 + 0 in total (QC-passed reads + QC-failed reads)
6130659 + 0 duplicates
87646145 + 0 mapped (86.12%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=710:04:06,mem=19180536kb,vmem=21211824kb,walltime=60:48:25
Good news: for lane 7, cut down to 30h from 60h walltime.
BAD NEWS: lane 6 never finished with -t 240 (autokilled @200h), with -t 150 it
is still running after 70h
Original comment by zack...@gmail.com
on 30 Jul 2012 at 6:25
with bsmap:
-t 150
101770400 + 0 in total (QC-passed reads + QC-failed reads)
4706708 + 0 duplicates
85804027 + 0 mapped (84.31%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=359:36:05,mem=18804824kb,vmem=21492556kb,walltime=31:26:01
-t 240
101770400 + 0 in total (QC-passed reads + QC-failed reads)
6130659 + 0 duplicates
87646145 + 0 mapped (86.12%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=710:04:06,mem=19180536kb,vmem=21211824kb,walltime=60:48:25
bsmap
101506814 + 263586 in total (QC-passed reads + QC-failed reads)
5774871 + 0 duplicates
96636932 + 0 mapped (95.20%:0.00%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=68:38:26,mem=14346176kb,vmem=15409440kb,walltime=07:58:35
Original comment by zack...@gmail.com
on 30 Jul 2012 at 6:45
example data has been made available to them.
Original comment by zack...@gmail.com
on 8 Aug 2012 at 8:47
has there been a response from novocraft ?
Original comment by benb...@gmail.com
on 28 Aug 2012 at 8:42
Yes, I've been talking to them about the issue and how to fix it. This morning,
I was sent an internal release (binary executable) that contains fixes. If all
goes well, these changes will make it into the official version.
Original comment by zack...@gmail.com
on 28 Aug 2012 at 9:13
Original comment by zack...@gmail.com
on 6 Nov 2012 at 10:37
Issue 284 has been merged into this issue.
Original comment by zack...@gmail.com
on 7 Nov 2012 at 7:55
this seems to be a dead topic. we continue to build upon bsmap and never
managed to get novo running fast enough. closing
Original comment by zack...@gmail.com
on 3 Apr 2013 at 11:30
Original issue reported on code.google.com by
zack...@gmail.com
on 25 Jul 2012 at 6:13