Closed svm-zhang closed 6 years ago
Hi Simo, Mashmap should be applicable for the problem you described above.
Do check the identity threshold (--perc_identity, --pi) parameter; you may want to adjust it according to your requirements. Increasing it (as well as seg. length) would lower down the runtime and memory usage. As you said, splitting database into chunks is also a valid option.
I will expect program to have high RAM and runtime for the thresholds you provided.
Let me know if you have any questions. Will be curious to know how it goes.
One more thing that may be helpful for you is to include the representative genome (corresponding to your assembly) in the database as well. This would help improve the specificity of the method for correct portions of your assembly.
Hi Jain,
Thanks very much for the tip! I just tried with higher --pi (95%) and the memory usage was dramatically reduced.
Regarding your latest comment, what did you mean by "representative genome"? You meant a related reference genome for the species I am assembling, right?
cheers, Simo
Right
Thanks.
I get another question: if I'd like to decontaminate at reads level, I should go with Mash instead of MashMap right?
Simo
It depends, I think Mash will help you find the source of contamination (e.g., which bacteria contribute to your assembly), but for finding which read is contaminated, you may want to use a read mapper.
Mash doesn't look at reads, only k-mer level analysis so the screen command can tell you what genomes are contained in your read set but not what reads are contributing to these genomes. MashMap gives you read level information so you could use Mash to find a set of candidate genomes to map to and then MashMap to map the reads to see which reads are contaminating.
This makes a lot of sense. Thanks very much!
Hi,
I've got the similar questions as svm-zhang. I downloaded all the bacteria genomes in NCBI and mapped the PacBio reads to remove those contaminated. I came across two problems and the version I used was v2.0.
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out
Start time is 2018/07/02--19:10
Reference = [/home/niuyw/RefData/NCBI_contaminants/contaminants.fa] Query = [third_all.fasta] Kmer size = 16 Window size = 90 Segment length = 5000 (read split allowed) Alphabet = DNA Percentage identity threshold = 85% Mapping output file = mashmap.out Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) Execution threads = 1
INFO, skch::Sketch::build, minimizers picked from reference = 985588968 INFO, skch::Sketch::index, unique minimizers = 58991923 INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 14684157) ... (88712, 1) INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 3592 times during lookup. INFO, skch::main, Time spent computing the reference index: 4258.82 sec INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [463569, 4080228, 6633142] INFO, skch::main, Time spent mapping the query : 18524.8 sec INFO, skch::main, mapping results saved in : mashmap.out Finish time is 2018/07/03--01:30
But when I adjusted the parameter -s
and --pi
to filter the contaminanted reads more strictly. It failed and seemed to be a RAM problem.
$path2mashmap -r $contaminants -t 8 -q third_all.fasta -s 500 --pi 70 -o mashmap1.out
>>>>>>>>>>>>>>>>>>
Reference = [/home/niuyw/RefData/NCBI_contaminants/contaminants.fa]
Query = [third_all.fasta]
Kmer size = 16
Window size = 1
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 70%
Mapping output file = mashmap1.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads = 8
>>>>>>>>>>>>>>>>>>
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
My questions are:
--perc_identity
)?
$ sort -t ' ' -k10,10n mashmap.out |head -3
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100026/37064_49118 12054 7054 12053 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100308/48537_58156 9619 4619 9618 - kraken:taxid|880070|NC_015914.1 6221273 4110216 4115215 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/103161/48125_53357 5232 232 5231 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104812/38481_50150 11669 0 4999 - kraken:taxid|877455|NC_015216.1 2583753 1814585 1819584 80.8247
Thank you in advance!
Hi @YiweiNiu , -s 500 --pi 70
are probably too strict for Mashmap. If you want to go for strict parameters than default, you can try -s 2500 --pi 85
or -s 2500 --pi 80
; that should be good enough for your application. More strict the parameters are, higher is the resource requirement (both memory and time).
Also, 10th column conveys the estimated identity from k-mer hits, which can be less than the threshold. These are outputted to ensure high sensitivity.
Thank you for your reply!
Could you please explain that more? Because there are ~10% reads whose length is less than 1kb, and ~40% less than 5kb. That's why I wanted to lower the -s
parameter.
And, you forgot my second qustion :) When I checked the 10th column of the output, I found the minimum figure was about 80%. Shouldn't it be 85% (the default of --perc_identity
)?
Try -s 500 --pi 85
and see if you get output.
For the second question, please see me previous response again, I had edited it later.
I tryied -s 500 --pi 85
and it ran successfully.
The PacBio reads were from several insects (mixed individuals to get enough DNA). And the $contaminants database contains about 3w sequences, which I merged archaea, bacteria, fungi, protozoa and viral sequences from NCBI. So it's very huge.
Hello,
Thanks very much for MashMap!
I have an assembly that is contaminated with lots of bacterial sequences, and I'd like to try MashMap to tease them out. I downloaded all complete RefSeq bacterial genomes, and put them in a single FASTA file. My assembly is highly fragmented, including 300k+ contigs. When I ran MashMap as below,
mashmap -r ${refseq} -q ${assembly} -s 500 -t ${NT} -o ${out}
the program consumes large amount of RAM (>= 200GB). My question is:
Any advice would be appreciated!
cheers, Simo