Closed yeli7068 closed 7 years ago
Hi,
At the moment I don't have "art" setted-up. I will look into the problem more in detail and get back to you soon. But I have a couple of questions:
Hi,
Thanks for your patience.
Art can generate reads at given X depth for each reference. That is to say, the command line above can generate difference number of reads at given depth.
i am outside of work. Can I give you the results two days later?
Cheers
Hi,
Please check the attachment as you needed. We also used some samples sequenced and confirmed by qPCR. And the results seemed not so good. Could you give me some suggestions?
Thanks for your help in advance.
Cheers
Hi,
Thanks for uploading the file. I have looked at it.
There are 5 references in the "AB-5K" database under the species Serratia marcescens and they have the following results each according to the raw output file. As you can see the NoOfUniqueReads (total ~ 14K) is much lower in number than the NoOfReads (total ~ 2637K) in general. SLIMM tries to increase the NoOfUniqueReads by discarding some of the genomes from consideration. That is what you see described as NoOfUniqueReads2 (total ~ 15K).
When SLIMM reports taxonomic profiles (e.g.) at species level it uses mainly NoOfUniqueReads2 under that species plus shared reads (but shared by only the genomes under that species). and the coverage is reported based on that. for Serratia marcescens we have 26k such reads. Which means there are additional 11K shared reads only among the 5 genomes of Serratia marcescens.
The coverage you see in the last column of the report is calculated as
(NoOfReads (uniquely appointed to a clade)) * averageReadLength/AverageGenomeLength (of participating genomes)
26K * 150 / 5274k = 0.73947
So you wouldn't find the same coverage as you simulated. I recommend using only the relative abundance information.
Hi,
Thanks for your response. I tried to follow your logic to test the E.coli O157. The situation of E.coli was very similar with Serratia marcescens/ As E. coli was species-level where E.coli O157 was sub-speiecs.
Here we go, the results reported by SLIMM is
4 Escherichia coli 562 644243 26.3492 6 18.238
The depth of E.coli was 18.238.
The results from the raw output related with E.coli was followings filtered by NoOfUniqueReads2 > 0 to save the space to show:
The coverage you see in the last column of the report is calculated as
(NoOfReads (uniquely appointed to a clade)) * averageReadLength/AverageGenomeLength (of participating genomes)
44k * 150 / 5M = 1.32
So the results is not the same as final reported.
Do you have any suggestions?
Cheers
44k * 150 / 5M = 1.32 You didn't follow my logic right. You used 44k which is only the sum of unique reads by E.coli. In my previous example, I didn't use only the sum which is 15K, but I used 15K + 11K other reads that are not unique to the genomes but are unique to the species Serratia marcescens .
In the case of E.coli you have a total of 644k reads from which 44k of them are unique to a single genome . The remaining 600K are shared by multiple shared by 1 or more genomes of E.coli.
so your coverage for E.coli will be 644K*150/5000K = 19.32
Hi,
Thanks. Could you tell the how did you tell the reads only shared by the specific species according to the table?
and the coverage is reported based on that. for Serratia marcescens we have 26k such reads.
The remaining 600K are shared by multiple shared by 1 or more genomes of E.coli.
`Thanks
That information is not solely based on the raw output table. For each reads, I know where they mapped to. And based on that SLIMM checks if a read belongs to a given species or not. You find the numbers inside the species level report under noOfReads column. There it is reported 26K for Serratia marcescens and 644k for E.coli
All right. So the so-called species-specific reads we can not get.
How to make the judgement of the Serratia marcescens is there? Set a cutoff for depth? This simulation is very ideal which cannot happen at real case. The reason I chose these 5 bacteria because I sequenced the samples mixed with the DNA of these bacteria. KRAKEN is good as it displayed all the positive results with fair abundance. The results from SLIMM, however, tell me nothing there, 0 species detected. I am surprised with this 0 species detected as I read your paper. So I carry this data exp out to check SLIMM.
The results were not happy so I started this discussion with you. Now you tell me a real ridiculous result:
if a read belongs to a given species or not. You find the numbers inside the species level report under noOfReads column. There it is reported 26K for Serratia marcescens and 644k for E.coli
Let`s translate it. SLIMM is saying E.coli is so specific. No other species under Escherichia share the reads from E.coli. The genomes under Escherichia were divergent. In addition, Serratia marcescens was so conserved. Lot of reads from Serratia marcescens were conserved.
Another issue will be raised. How does SLIMM make the species-specific reads? SLIMM may take many many reasons into consideration to make the judgement which reads is specific at relative rank. I think one of these reasons tell the reads specific or not based on the database used. Do you think there is a chance that the slimm-specific reads are not specific if the subject database is NT or other comprehensive collection not only include bacteria?
The final reason we are taking the discussions here because I think SLIMM has the potential to be great.
In your particular simulated test:
If you want a more detailed information and haven't done it already, I recommend reading the paper.
I am closing the issue and locking the conversation. You may contact me in private.
Hi,
Congratulations on your works, I just tested SLIMM and I found it was partially working as I wished.
I downloaded the 5 sequences as references to generate simulation read with art. Then these reads were subjected to AB-5K pre-formatted database as you provided. The alignment result was analysis by SLIMM.
Reference:
I think the coverage should be like depth-coverage. Then the value for each species tested should be 20. But coverage for Serratia marcescens were almost 0. Anyway the good news is what I want is reported.
Results (sorted by coverage): 3 Streptococcus pneumoniae 1313 287886 11.7744 1 19.9843 9 Enterococcus faecium 1352 399132 16.3243 1 18.4821 4 Escherichia coli 562 644243 26.3492 6 18.238 7 Staphylococcus epidermidis 1282 304714 12.4626 4 17.6741 5 Serratia marcescens 615 25625 1.04805 4 0.714683 6 Serratia sp. FS14 1327989 3657 0.149569 1 0.104488 1 Escherichia fergusonii 564 3336 0.136441 1 0.102832 2 Serratia sp. SCBI 488142 3179 0.130019 1 0.093465 8 Citrobacter rodentium 67825 1362 0.0557051 1 0.0375254
Code:
simulation
art_illumina -ss HSXt -sam -i candidate.fasta -p -l 150 -f 20 -m 200 -s 50 -o candidate_paired
alignment with bowtie2
bowtie2 -x /home/liyang/test_tool/slimm-0.2.2/AB_5K_indexed_ref_genomes_bowtie2/AB_5K -1 candidate_paired1.fq -2 candidate_paired2.fq -q --no-unal --mm -p 10 -k 60 2>candidate_bowtie2.log | samtools view -bS - > candidate_bowtie2.bam
Analysis with SLIMM
slimm -m /home/liyang/test_tool/slimm-0.2.2/slimmDB_5K/ -o ./ candidate_bowtie2.bam
Could you give me some suggestions?
Cheers!