contact novocraft and ask why slow

GoogleCodeExporter commented 8 years ago

running for days on bis

Original issue reported on code.google.com by zack...@gmail.com on 25 Jul 2012 at 6:13

Blocking: #284

GoogleCodeExporter commented 8 years ago

Original comment by zack...@gmail.com on 25 Jul 2012 at 7:14

Now blocking: #284

GoogleCodeExporter commented 8 years ago

Hi Zayed, 

I'm running NovoAlign on our bisulfite data and experiencing some weirdness. On 
certain lanes, it seems to run forever, CPU shows full load on all cores, so it 
must be doing something:

# License file: 
/home/uec-00/shared/production/software/novoalign/default/novoalign.lic
# Licensed to Norris_Cancer_Center_USC_
#  novoalign -d 
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
 -f s_6_sequence.1.nocontam.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H 
# Starting at Mon Jul  9 16:04:14 2012
# Interpreting input files as Illumina FASTQ, Casava Pipeline 1.3 to 1.7.
# Index Build Version: 2.8
# Methylation Mode ON - For bisulphite treated sequences.
# Hash length: 17
# Step size: 2
=>> PBS: job killed: walltime 720004 exceeded limit 720000
(after 1 week of running job was automatically killed)

resource consumption was as follows:

 cput=2395:14:30,mem=22166744kb,vmem=26463040kb,walltime=200:00:05

(cput is commutative for all 12 cores, so nearly 100% load)

compare with another lane:

/home/uec-00/shared/production/software/novoalign/default/novoalign -d 
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
 -f s_1_1_sequence.1.txt s_1_2_sequence.1.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H 
> s_1_1_sequence.1.txt.bam.sam
# novoalign (V2.08.01 - Build Mar 14 2012 @ 10:25:47 - A short read aligner 
with qualities.
# (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
# License file: 
/home/uec-00/shared/production/software/novoalign/default/novoalign.lic
# Licensed to Norris_Cancer_Center_USC_
#  novoalign -d 
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.novoindex.bis
 -f s_1_1_sequence.1.txt s_1_2_sequence.1.txt -F ILMFQ -b 2 -o SAM -k -t 240 -H 
# Starting at Mon Jul  9 18:16:28 2012
# Interpreting input files as Illumina FASTQ, Casava Pipeline 1.3 to 1.7.
# Index Build Version: 2.8
# Methylation Mode ON - For bisulphite treated sequences.
# Hash length: 17
# Step size: 2
#       Paired Reads: 124883957
#      Pairs Aligned: 111496890
#     Read Sequences: 249767914
#            Aligned: 232294769
#   Unique Alignment: 220953099
#   Gapped Alignment: 11143484
#     Quality Filter:  1908353
# Homopolymer Filter:   127093
#       Elapsed Time: 270726.531 (sec.)
#           CPU Time: 53803.8 (min.)

which finished in about 75 hours using a 12 core machine.

 Second questions is: can we bring the 75 down to something lower? also, my node class of 12core, 24G men machines is in short supply, is there a way to bring down the memory usage to under 15, in which case i can use a lesser spec of server. 

I am aware of the MPI version which would split up  the problem onto multiple 
machines,  but the problem I'm having is resource utilization, not a lack of 
parallelization. I need novoalign to consume way less resources given whats 
available to me.

thanks,

Zack Ramjan
USC Epigenome Center

Original comment by zack...@gmail.com on 26 Jul 2012 at 6:34

Changed state: Started

GoogleCodeExporter commented 8 years ago

Hi Zack,

Bisulphite mode is always a lot slower than NA mode but it shouldn't take 
forever.

One problem might be your -t setting, 240 seems quite high for single end 
reads, your other run was paired end and had same -t 240 setting, paired end is 
also faster than single end.

How long are the reads? 

With high threshold Novoalign will go up to 8 mismatches per read and 15bp 
indels not counting C/T mismatches. This can be a lot for 100bp reads.  Also if 
there's a problem aligning reads, say, contamination, or may be adapter 
sequence on the reads then Novoalign might be trying to align with -t 240 on 
quite a lot of reads. Novoalign tends to slow down a lot if many reads are 
failing to align.

Have a look at reads to see if there is adapter sequence, if so you can trim 
with the -a option.

Consider lowering threshold to say -t 150 or similar. Perhaps test a few K 
reads at different -t to see effect on sensitivity.

If a large proportion of reads are not being mapped I'd double check that -b2 
is correct mode, may be try -b 4 -u 8 and see what happens.

For general case thare are a few things that might help performance,
First would be reducing the -t threshold, this can cause loss of a few 
alignments and you may miss alignments near large deletions or clusters of 
SNPs. This is often not important for Bi-Seq and if you are doing a lot I 
suggest experimenting.
You can also try different k&s settings for the index, this can change 
performance quite a bit and memory. Here's some test results from another user:
Using 4000 reads on an insect genome:

Hash length

15

16

16

16

17

17

17

18

19

Step size

2

1

2

3

1

2

3

1

1

Elapsed Time (sec.)

30.846

24.690

22.576

28.485

26.782

18.736

20.504

15.595

7.78

CPU Time (min.)

9.0

5.2

6.2

9.0

4.6

5.4

6.8

4.1

3.5

The 18/1 & 19/1 indexes reduced run time quite dramatically but the indexes are 
large. Lower k and higher s reduce index sizes and in above tests a 16/2 index 
wasn't much slower than a 17/2. These tests were done on 4K reads, I think it 
needs to be done on 100K reads.

22G for a job using a 10G index is very high, usually we get by with 2-3G more 
than the index. It's possible that some reads hit very high copy number repeats 
producing millions of alignment locations. Adding the -e option can help stop 
this, it may be more important for single end reads, try -e 10000, this 
basically stops alignment process if we've found 10,000 alignments with the 
same alignment score.

If this doesn't help then I'd be happy to look at reads.

Kind Regards,
Colin

Original comment by zack...@gmail.com on 27 Jul 2012 at 6:27

GoogleCodeExporter commented 8 years ago

currently rerunning using colins suggestions.

Original comment by zack...@gmail.com on 27 Jul 2012 at 8:15

GoogleCodeExporter commented 8 years ago

Zack , could you try pulling out a small FASTQ subsection of the "fast" and 
"slow" samples, to see if we can replicate difference using a smaller file for 
testing (and maybe we can send to Colin).  I definitely think we should use 
their adapter trimming as suggested, I didn't realize we weren't.

Original comment by benb...@gmail.com on 27 Jul 2012 at 8:18

GoogleCodeExporter commented 8 years ago

Zack , could you try pulling out a small FASTQ subsection of the "fast" and 
"slow" samples, to see if we can replicate difference using a smaller file for 
testing (and maybe we can send to Colin).  I definitely think we should use 
their adapter trimming as suggested, I didn't realize we weren't.

Original comment by benb...@gmail.com on 27 Jul 2012 at 8:18

GoogleCodeExporter commented 8 years ago

ben: yes, small subsets are being run in research/zack. adapter trimming is now 
on.

Original comment by zack...@gmail.com on 27 Jul 2012 at 8:31

GoogleCodeExporter commented 8 years ago

Yaping: I just run and try different threshold of -t in a 1M sample single end 
reads. I believe this is the major problem. -t most depends on the combination 
of reads length and genome size, novoalign calculate 100PE as 200bp reads 
length while this time, the problematic lane's read is 100SE, which will be 
treated as 100bp read. -t 240 is too high for 100SE reads as their -t 
calculation formulae: 3*(read_length/2-20).

I tried different index k&s parameter to do genome index before, then test the 
alignment in 1M pair end reads, it looks like time cost not change so much...

Original comment by lyping1...@gmail.com on 27 Jul 2012 at 11:33

GoogleCodeExporter commented 8 years ago

Hi Zack,
here is my test by using novo-align on 1M single end reads(100 SE) in different 
-t threshold. It looks like after -t more than 100, the CPU time cost increases 
exponentially. But the reads mapping percentage not increase so much. Maybe we 
should use criteria -t 90 for 100 SE read. Do we have reads on 50SE/PE for 
Bisulfite-seq? For different reads length, we may need to do the test 
separately.

t-threshold mapping_percentage  CPU-time_cost (min)
30  83.07%  64
50  85.59%  83
70  87.71%  101
90  88.69%  148
100 88.90%  188
120 89.39%  306
150 89.82%  540
170 90.02%  698
190 90.22%  858
220 NA  Not_finished
240 NA  Not_finished

For the plot, x-axis is different -t threshold. y-axis is CPU time (minutes) or 
Reads_mapping_percentage

Original comment by lyping1...@gmail.com on 28 Jul 2012 at 2:04

Attachments:

GoogleCodeExporter commented 8 years ago

This is helpful yaping, wow cost gets very expensive... So in my perl wrapper, 
I will probably have to set the -t on dynamically based upon PE and cycles. It 
would have been nice if they set t dynamically using their own recommended 
formula, but we can hack it in the perl script

Original comment by zack...@gmail.com on 30 Jul 2012 at 6:12

GoogleCodeExporter commented 8 years ago

B02WJABXX Lane 7

-t 150
101770400 + 0 in total (QC-passed reads + QC-failed reads)
4706708 + 0 duplicates
85804027 + 0 mapped (84.31%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=359:36:05,mem=18804824kb,vmem=21492556kb,walltime=31:26:01

-t 240 
101770400 + 0 in total (QC-passed reads + QC-failed reads)
6130659 + 0 duplicates
87646145 + 0 mapped (86.12%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=710:04:06,mem=19180536kb,vmem=21211824kb,walltime=60:48:25

Good news: for lane 7, cut down to 30h from 60h walltime.

BAD NEWS: lane 6 never finished with -t 240 (autokilled @200h), with -t 150 it 
is still running after 70h

Original comment by zack...@gmail.com on 30 Jul 2012 at 6:25

GoogleCodeExporter commented 8 years ago

with bsmap:

-t 150
101770400 + 0 in total (QC-passed reads + QC-failed reads)
4706708 + 0 duplicates
85804027 + 0 mapped (84.31%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=359:36:05,mem=18804824kb,vmem=21492556kb,walltime=31:26:01

-t 240 
101770400 + 0 in total (QC-passed reads + QC-failed reads)
6130659 + 0 duplicates
87646145 + 0 mapped (86.12%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=710:04:06,mem=19180536kb,vmem=21211824kb,walltime=60:48:25

bsmap
101506814 + 263586 in total (QC-passed reads + QC-failed reads)
5774871 + 0 duplicates
96636932 + 0 mapped (95.20%:0.00%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
cput=68:38:26,mem=14346176kb,vmem=15409440kb,walltime=07:58:35

Original comment by zack...@gmail.com on 30 Jul 2012 at 6:45

GoogleCodeExporter commented 8 years ago

example data has been made available to them.

Original comment by zack...@gmail.com on 8 Aug 2012 at 8:47

GoogleCodeExporter commented 8 years ago

has there been a response from novocraft ?

Original comment by benb...@gmail.com on 28 Aug 2012 at 8:42

GoogleCodeExporter commented 8 years ago

Yes, I've been talking to them about the issue and how to fix it. This morning, 
I was sent an internal release (binary executable) that contains fixes. If all 
goes well, these changes will make it into the official version.

Original comment by zack...@gmail.com on 28 Aug 2012 at 9:13

GoogleCodeExporter commented 8 years ago

Original comment by zack...@gmail.com on 6 Nov 2012 at 10:37

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Issue 284 has been merged into this issue.

Original comment by zack...@gmail.com on 7 Nov 2012 at 7:55

GoogleCodeExporter commented 8 years ago

this seems to be a dead topic. we continue to build upon bsmap and never 
managed to get novo running fast enough. closing

Original comment by zack...@gmail.com on 3 Apr 2013 at 11:30

Changed state: WontFix

uec / Issue.Tracker

contact novocraft and ask why slow #283