MosaikAligner: Run out the allocated position memory

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Build genome and reads .DAT files
2. Make jump databases
3. Attempt to align ~5M reads to the genome

What is the expected output? What do you see instead?

I expect to see a BAM file with valid alignments

What version of the product are you using? On what operating system?

2.1.33 on Mac OS X v10.6.8, 8GB RAM

Please provide any additional information below.

Here are the commands and resulting outputs:

$ MosaikBuild -fr dm3_ambig.fa -oa dm3_ambig.dat

------------------------------------------------------------------------------Mo
saikBuild 2.1.33                                                  2011-11-08
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- converting ../Dmel/dm3_ambig.fa to a reference sequence archive.

- parsing reference sequences:
ref seqs: 5 (1.11 ref seqs/s)

- writing reference sequences:
100%[=======================================================]      3.33 ref 
seqs/s        in  1 s  

- calculating MD5 checksums:
100%[=======================================================]      5.00 ref 
seqs/s        in  1 s  

- writing reference sequence index:
100%[=======================================================]      5.00 ref 
seqs/s        in  1 s  

- creating concatenated reference sequence:
100%[=======================================================]      5.00 ref 
seqs/s        in  1 s  

- writing concatenated reference sequence...        finished.
- creating concatenated 2-bit reference sequence... finished.
- writing concatenated 2-bit reference sequence...  finished.
- writing masking vector...                         finished.

MosaikBuild CPU time: 7.840 s, wall time: 11.572 s

$ MosaikJump -ia dm3_ambig.dat -out dm3_ambig_15 -hs 15

------------------------------------------------------------------------------
MosaikJump 2.1.33                                                   2011-11-08
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- retrieving reference sequence... finished.

- hashing reference sequence:
100%[=========================================================] 3,697,325 
hashes/s        in 32 s  

- serializing final sorting vector... finished.

- writing jump positions database:
100%[=================================================] 204,738.6 hash 
positions/s       in 09:40  

- serializing jump keys database (17 blocks):
blocks: 17 (0.1689 blocks/s)

- cleaning up temp files...finished.

MosaikJump CPU time: 612.656 s, wall time: 820.335 s

$ MosaikBuild -fr chr2L_single_bp50_error0_equal.fa -out 
chr2L_single_bp50_error0_equal.dat -st illumina -assignQual 60

------------------------------------------------------------------------------
MosaikBuild 2.1.33                                                  2011-11-08
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- setting read group ID to: ZGO3OS1OIK2
- setting sample name to: unknown
- setting sequencing technology to: illumina

- parsing FASTA files:
reads: 4,890,996 (126,503.3 reads/s)

Filtering statistics:
============================================
# reads written:           4890996
# bases written:         244549800

MosaikBuild CPU time: 31.436 s, wall time: 39.814 s

$ MosaikAligner -in chr2L_single_bp50_error0_equal.dat -out 
chr2L_single_bp50_error0_equal.dm3_ambig.dat -ia dm3_ambig.dat -hs 15 -mm 0 -m 
unique -j dm3_ambig_15 -p 2 -annse 2.1.26.se.100.005.ann -annpe 
2.1.26.pe.100.0065.ann

------------------------------------------------------------------------------
MosaikAligner 2.1.33                                                2011-11-08
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- Using the following alignment algorithm: all positions
- Using the following alignment mode: only aligning unique reads
- Using a maximum mismatch threshold of 0
- Using a hash size of 15
- Using 2 processors
- Using a Smith-Waterman bandwidth of 1
- Using an alignment candidate threshold of 23bp.
- Setting hash position threshold to 200
- Using a jump database for hashing. Storing keys & positions in memory.
- loading reference sequence... finished.
- loading jump key database into memory... finished.
- loading jump positions database into memory... ERROR: Run out the allocated 
position memory.

Original issue reported on code.google.com by krai...@gmail.com on 15 Nov 2011 at 8:20

GoogleCodeExporter commented 8 years ago

Hi there,

May I know that how many base pairs of the genome?

Original comment by WanPing....@gmail.com on 22 Nov 2011 at 3:25

GoogleCodeExporter commented 8 years ago

The genome is a modified version of the original D. melanogaster reference dm3 
with positions edited and replaced with IUPAC ambiguity codes from known SNP 
locations. It is also a subset of the chromosomes (2L,2R,3L,3R,X), and in total 
is 119,029,689 bp (I would simply attach it but it exceeds the 10MB limit).

Thank you very much for looking into this! Let me know if you'd like me to send 
you both input files (the reference and the sequences I am attempting to align).

-Kraig

Original comment by krai...@gmail.com on 22 Nov 2011 at 3:46

GoogleCodeExporter commented 8 years ago

Hi Kraig,

I did a very simple test for memory usage by using human chr13 since the size 
of it (119,029,689 bps) is close to yours.

memory usage:
with jump db: 5894M
without jump db: 6508M

It seems that both of them should work on 8G RAM machine.

May I know the file sizes of dm3_ambig.dat, dm3_ambig_15_positions.jmp, and 
dm3_ambig_keys.jmp?

Original comment by WanPing....@gmail.com on 22 Nov 2011 at 9:59

Changed state: Started

GoogleCodeExporter commented 8 years ago

dm3_ambig.dat: 163MB
dm3_ambig_15_positions.jmp: 814M
dm3_ambig_keys.jmp: 5GB

That seems to be quite on par with the memory usage. I tried using the option 
-pd, and the new output (and error) I get is the following:

------------------------------------------------------------------------------
MosaikAligner 2.1.33                                                2011-11-08
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- Using the following alignment algorithm: all positions
- Using the following alignment mode: only aligning unique reads
- Using a maximum mismatch threshold of 0
- Using a hash size of 15
- Using 2 processors
- Using a Smith-Waterman bandwidth of 1
- Using an alignment candidate threshold of 23bp.
- Setting hash position threshold to 200
- Using a jump database for hashing. Storing keys in memory.
- loading reference sequence... finished.
- loading jump key database into memory... finished.
finished.

Aligning read library (7651826):
 0% [                                      ]                                  |ERROR: A position (47489640703) was specified that is larger than the jump positions database (853649916).

Original comment by krai...@gmail.com on 22 Nov 2011 at 10:42

GoogleCodeExporter commented 8 years ago

I agree; it seems so.
chr13.dat: 129M
chr13_15_positions.jmp: 630M
chr13_15_keys.jmp: 5.0G
So, I think that the memory usage of dm3_ambig may exceed or be very close to 
the limitation.

Do you mind to try with smaller hash size, saying 14? Its keys.jmp is smaller.

For -pd, to be honest, I didn't maintain it for a long time; we're planning to 
replace it by adopting other database libraries, e.g. Kyoto cabinet or leveldb.

Original comment by WanPing....@gmail.com on 23 Nov 2011 at 4:49

GoogleCodeExporter commented 8 years ago

Great, that seems to have worked. I now have several alignment files 
(*.dat.bam, *.dat.multiple.bam) and a *.stat file. 

One interesting observation is that, of the ~7M reads that I attempted to 
align, only half of them aligned successfully (excluding ~8K mapping to 
multiple locations). This is interesting because the reads I am aligning have 
been generated from transcribed regions of the genome that I am aligning them 
to. These 50bp reads were generated with no errors in the following manner: 
given the position of a constitutive exon, generate random 50bp reads in this 
region. The reads are unambiguous, unlike the genome they are being aligned to. 

Can you think of a reason why reads generated from known transcribed regions of 
a genome only have a 50/50 chance of aligning back to that same genome where 
they came from? I've also excluded reads from exon-exon junctions, so that 
shouldn't be the cause of the low mapping percentage.

Any help on this matter would be greatly appreciated!

-Kraig

Original comment by krai...@gmail.com on 23 Nov 2011 at 6:43

GoogleCodeExporter commented 8 years ago

I think IUPAC in references causes this problem since only 'A', 'C', 'G', and 
'T' will be hashed.

Original comment by WanPing....@gmail.com on 23 Nov 2011 at 7:14

GoogleCodeExporter commented 8 years ago

Hmm, that's interesting, because the original MOSAIK documentation says that it 
can handle ambiguity codes in the reference, "i.e. if the reference genome has 
an 'M' allele, a read position with an 'A' or 'C' at that location will be 
scored as a match".

Also, of the half that did align, half of those belong to the alternate allele 
(the reason I'm aligning the generated reads to an ambiguous genome is because 
I've generated them from two versions of the original reference, so as to 
simulate reads from different alleles), so I'm pretty sure the IUPAC codes are 
being hashed, otherwise these would've been kicked out due to the maximum 
mismatch threshold of 0.

Original comment by krai...@gmail.com on 23 Nov 2011 at 7:27

GoogleCodeExporter commented 8 years ago

The case I'm worrying about is like
ref: AGTMCMGT
     ||| | ||
read:AGTCCAGT, with hash size = 4.
We'll hash nothing for the ref since we only hash ACGT. Therefore, the read 
will hit nothing in the hash table.

Let's see the other example.
ref: AGTMCAGT
     ||| ||||
read:AGTCCAGT, with hash size = 4. The read will hit a hash, CAGT, and then 
apply Smith-Waterman (SW) to get the alignment. Since (C,M) is not considered a 
mismatch in SW, the program doesn't filter the alignment out. I think that the 
reason why we can see those half aligned reads.

Original comment by WanPing....@gmail.com on 23 Nov 2011 at 7:59

GoogleCodeExporter commented 8 years ago

To resolve this problem, MosaikJump should take IUPAC into account.

Original comment by WanPing....@gmail.com on 23 Nov 2011 at 8:01

GoogleCodeExporter commented 8 years ago

Ah I see. Is that something planned for the future? I think this would be a 
really useful feature. 

Please let me know, otherwise a work-around might be decreasing my hash size, 
which may solve the lack of alignments to SNP-dense regions. Is that a correct 
assumption?

-Kraig

Original comment by krai...@gmail.com on 23 Nov 2011 at 9:18

GoogleCodeExporter commented 8 years ago

Hi Kraig,

I just push MosaikJump with IUPAC feature. To enable it, please use -iupac for 
MosaikJump. And, please let me know how does it work?

//Wan-Ping

Original comment by WanPing....@gmail.com on 28 Nov 2011 at 7:35

GoogleCodeExporter commented 8 years ago

Hello Wan-Ping,

I don't see a new version of MOSAIK in the downloads section, but just in case 
the date didn't get updated, I downloaded the latest version, compiled, and 
tried the following command:

MosaikJump -ia ../Dmel/dm3_ambig.dat -out ../Dmel/dm3_ambig_14 -hs 14 -iupac

The -iupac was not recognized by version 2.1.33. Is there something I am doing 
wrong?

Thanks!

-Kraig

Original comment by krai...@gmail.com on 28 Nov 2011 at 8:17

GoogleCodeExporter commented 8 years ago

Hi Kraig,

I didn't wrap and put it on downloads; I pushed it on git repo. Do you need it 
on downloads?

Original comment by WanPing....@gmail.com on 28 Nov 2011 at 8:21

GoogleCodeExporter commented 8 years ago

Nope, I just downloaded git for Mac, no problem. However, after attempting to 
realign the reads using the new jump database, not a single read aligned:

------------------------------------------------------------------------------
MosaikAligner 2.1.50                                                2011-11-22
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- Using the following alignment algorithm: all positions
- Using the following alignment mode: only aligning unique reads
- Using a maximum mismatch threshold of 0
- Using a hash size of 14
- Using 2 processors
- Using a Smith-Waterman bandwidth of 1
- Using an alignment candidate threshold of 23bp.
- Setting hash position threshold to 200
- Using a jump database for hashing. Storing keys & positions in memory.
- loading reference sequence... finished.
- loading jump key database into memory... finished.
- loading jump positions database into memory... finished.

Aligning read library (7651826):
100%[=====================================]  21,182.6 reads/s       in 06:01  

- cleaning up temp files...finished.

Alignment statistics (mates):
================================================
# filtered out(F):             7651826 (100.0 %)
================================================
total aligned:                       0 (  0.0 %)
total:                         7651826

MosaikAligner CPU time: 703.031 s, wall time: 385.561 s

Original comment by krai...@gmail.com on 28 Nov 2011 at 10:31

GoogleCodeExporter commented 8 years ago

Hi Kriag,

I saw and fixed that bug. Could you please try it again?

Original comment by WanPing....@gmail.com on 29 Nov 2011 at 4:46

GoogleCodeExporter commented 8 years ago

Thanks Wan-Ping!

The alignment worked this time, but I am still seeing less than desirable 
alignment statistics, given that these reads were generated from the same 
reference to which they're being aligned (my guess now is that the SW algorithm 
is performing worse in SNP-dense regions, and that is what is filtering half of 
the reads out). Are there any other obvious parameters you can see in the 
output that you think might be responsible? Here is the output:

------------------------------------------------------------------------------
MosaikAligner 2.1.52                                                2011-11-28
Michael Stromberg & Wan-Ping Lee  Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- Using the following alignment algorithm: all positions
- Using the following alignment mode: only aligning unique reads
- Using a maximum mismatch threshold of 0
- Using a hash size of 14
- Using 2 processors
- Using a Smith-Waterman bandwidth of 1
- Using an alignment candidate threshold of 23bp.
- Setting hash position threshold to 200
- Using a jump database for hashing. Storing keys & positions in memory.
- loading reference sequence... finished.
- loading jump key database into memory... finished.
- loading jump positions database into memory... finished.

Aligning read library (7651826):
100%[=====================================]  15,993.7 reads/s       in 07:58  

- cleaning up temp files...finished.

Alignment statistics (mates):
================================================
# filtered out(F):             3588112 ( 46.9 %)
# uniquely aligned mates(U):   4055485 ( 53.0 %)
# multiply aligned mates(M):      8229 (  0.1 %)
================================================
total aligned:                 4063714 ( 53.1 %)
total:                         7651826

MosaikAligner CPU time: 829.610 s, wall time: 535.344 s

Original comment by krai...@gmail.com on 29 Nov 2011 at 7:44

GoogleCodeExporter commented 8 years ago

I may try smaller -act; currently it is 23 (- Using an alignment candidate 
threshold of 23bp.)
The minimal number for -act is the hash size that is 14 in this example.

Original comment by WanPing....@gmail.com on 29 Nov 2011 at 8:51

GoogleCodeExporter commented 8 years ago

I tried using the minimal -act of 14, and that increased the number of uniquely 
mapped reads by 0.1% (~ 6k reads). Still not quite optimal, considering the 
reads were generated directly, without modeling error rate, from the reference 
they are being aligned to. When I generated the reads, I included metadata in 
the read name that allows me to confirm their location origin (i.e. 
>HWI_2L_CG5869_16302527_16302577_1_refPos implies that the read originated from 
the left arm of chromosome 2 and is from gene CG5869, with a base-span 
corresponding to BED format from 16302527 to 16302577). I'm going to do a 
couple more tests to see if I can pinpoint exactly what might be going on here, 
so I will keep you posted. 

Thanks for all your help!

-Kraig

Original comment by krai...@gmail.com on 5 Dec 2011 at 3:38

GoogleCodeExporter commented 8 years ago

Hi Kraig,

Thanks for the feedback.
I'm thinking using smaller hash size, ~7, and the same number for -act.

Original comment by WanPing....@gmail.com on 5 Dec 2011 at 4:07

guanchangge / mosaik-aligner

MosaikAligner: Run out the allocated position memory #114