seqan / slimm

Species Level Identification of Microbes from Metagenomes
Other
27 stars 3 forks source link

Simulation tests failed! #14

Closed yeli7068 closed 7 years ago

yeli7068 commented 7 years ago

Hi,

Congratulations on your works, I just tested SLIMM and I found it was partially working as I wished.

I downloaded the 5 sequences as references to generate simulation read with art. Then these reads were subjected to AB-5K pre-formatted database as you provided. The alignment result was analysis by SLIMM.

Reference:

NC_003028.3 Streptococcus pneumoniae TIGR4, complete genome NC_021994.1 Enterococcus faecium Aus0085, complete genome NZ_CP015831.1 Escherichia coli O157 strain 644-PT8, complete genome NZ_CP020463.1 Staphylococcus epidermidis strain 1457 chromosome, complete genome NZ_CP016032.1 Serratia marcescens strain U36365, complete genome

I think the coverage should be like depth-coverage. Then the value for each species tested should be 20. But coverage for Serratia marcescens were almost 0. Anyway the good news is what I want is reported.

Results (sorted by coverage): 3 Streptococcus pneumoniae 1313 287886 11.7744 1 19.9843 9 Enterococcus faecium 1352 399132 16.3243 1 18.4821 4 Escherichia coli 562 644243 26.3492 6 18.238 7 Staphylococcus epidermidis 1282 304714 12.4626 4 17.6741 5 Serratia marcescens 615 25625 1.04805 4 0.714683 6 Serratia sp. FS14 1327989 3657 0.149569 1 0.104488 1 Escherichia fergusonii 564 3336 0.136441 1 0.102832 2 Serratia sp. SCBI 488142 3179 0.130019 1 0.093465 8 Citrobacter rodentium 67825 1362 0.0557051 1 0.0375254

Code:

simulation

art_illumina -ss HSXt -sam -i candidate.fasta -p -l 150 -f 20 -m 200 -s 50 -o candidate_paired

alignment with bowtie2

bowtie2 -x /home/liyang/test_tool/slimm-0.2.2/AB_5K_indexed_ref_genomes_bowtie2/AB_5K -1 candidate_paired1.fq -2 candidate_paired2.fq -q --no-unal --mm -p 10 -k 60 2>candidate_bowtie2.log | samtools view -bS - > candidate_bowtie2.bam

Analysis with SLIMM

slimm -m /home/liyang/test_tool/slimm-0.2.2/slimmDB_5K/ -o ./ candidate_bowtie2.bam

Could you give me some suggestions?

Cheers!

temehi commented 7 years ago

Hi,

At the moment I don't have "art" setted-up. I will look into the problem more in detail and get back to you soon. But I have a couple of questions:

  1. Did you have 5 (Pair) simulated read files for each species first and then combined them?
  2. Could you please run SLIMM again on the same BAM file with -or (--output-raw) option and attach the "candidate_bowtie2.tsv" file?
yeli7068 commented 7 years ago

Hi,

Thanks for your patience.

Art can generate reads at given X depth for each reference. That is to say, the command line above can generate difference number of reads at given depth.

i am outside of work. Can I give you the results two days later?

Cheers

yeli7068 commented 7 years ago

Hi,

Please check the attachment as you needed. We also used some samples sequenced and confirmed by qPCR. And the results seemed not so good. Could you give me some suggestions?

Thanks for your help in advance.

Cheers

candidate_bowtie2.zip

temehi commented 7 years ago

Hi,

Thanks for uploading the file. I have looked at it.

There are 5 references in the "AB-5K" database under the species Serratia marcescens and they have the following results each according to the raw output file. image As you can see the NoOfUniqueReads (total ~ 14K) is much lower in number than the NoOfReads (total ~ 2637K) in general. SLIMM tries to increase the NoOfUniqueReads by discarding some of the genomes from consideration. That is what you see described as NoOfUniqueReads2 (total ~ 15K).

When SLIMM reports taxonomic profiles (e.g.) at species level it uses mainly NoOfUniqueReads2 under that species plus shared reads (but shared by only the genomes under that species). and the coverage is reported based on that. for Serratia marcescens we have 26k such reads. Which means there are additional 11K shared reads only among the 5 genomes of Serratia marcescens.

The coverage you see in the last column of the report is calculated as

(NoOfReads (uniquely appointed to a clade)) * averageReadLength/AverageGenomeLength (of participating genomes)

26K * 150 / 5274k = 0.73947

So you wouldn't find the same coverage as you simulated. I recommend using only the relative abundance information.

yeli7068 commented 7 years ago

Hi,

Thanks for your response. I tried to follow your logic to test the E.coli O157. The situation of E.coli was very similar with Serratia marcescens/ As E. coli was species-level where E.coli O157 was sub-speiecs.

Here we go, the results reported by SLIMM is

4 Escherichia coli 562 644243 26.3492 6 18.238

The depth of E.coli was 18.238.

The results from the raw output related with E.coli was followings filtered by NoOfUniqueReads2 > 0 to save the space to show:

image

The coverage you see in the last column of the report is calculated as

(NoOfReads (uniquely appointed to a clade)) * averageReadLength/AverageGenomeLength (of participating genomes)

44k * 150 / 5M = 1.32

So the results is not the same as final reported.

Do you have any suggestions?

Cheers

temehi commented 7 years ago

44k * 150 / 5M = 1.32 You didn't follow my logic right. You used 44k which is only the sum of unique reads by E.coli. In my previous example, I didn't use only the sum which is 15K, but I used 15K + 11K other reads that are not unique to the genomes but are unique to the species Serratia marcescens .

In the case of E.coli you have a total of 644k reads from which 44k of them are unique to a single genome . The remaining 600K are shared by multiple shared by 1 or more genomes of E.coli.

so your coverage for E.coli will be 644K*150/5000K = 19.32

yeli7068 commented 7 years ago

Hi,

Thanks. Could you tell the how did you tell the reads only shared by the specific species according to the table?

and the coverage is reported based on that. for Serratia marcescens we have 26k such reads.

The remaining 600K are shared by multiple shared by 1 or more genomes of E.coli.

`Thanks

temehi commented 7 years ago

That information is not solely based on the raw output table. For each reads, I know where they mapped to. And based on that SLIMM checks if a read belongs to a given species or not. You find the numbers inside the species level report under noOfReads column. There it is reported 26K for Serratia marcescens and 644k for E.coli

yeli7068 commented 7 years ago

All right. So the so-called species-specific reads we can not get.

How to make the judgement of the Serratia marcescens is there? Set a cutoff for depth? This simulation is very ideal which cannot happen at real case. The reason I chose these 5 bacteria because I sequenced the samples mixed with the DNA of these bacteria. KRAKEN is good as it displayed all the positive results with fair abundance. The results from SLIMM, however, tell me nothing there, 0 species detected. I am surprised with this 0 species detected as I read your paper. So I carry this data exp out to check SLIMM.

The results were not happy so I started this discussion with you. Now you tell me a real ridiculous result:

if a read belongs to a given species or not. You find the numbers inside the species level report under noOfReads column. There it is reported 26K for Serratia marcescens and 644k for E.coli

Let`s translate it. SLIMM is saying E.coli is so specific. No other species under Escherichia share the reads from E.coli. The genomes under Escherichia were divergent. In addition, Serratia marcescens was so conserved. Lot of reads from Serratia marcescens were conserved.

Another issue will be raised. How does SLIMM make the species-specific reads? SLIMM may take many many reasons into consideration to make the judgement which reads is specific at relative rank. I think one of these reasons tell the reads specific or not based on the database used. Do you think there is a chance that the slimm-specific reads are not specific if the subject database is NT or other comprehensive collection not only include bacteria?

The final reason we are taking the discussions here because I think SLIMM has the potential to be great.

temehi commented 7 years ago

In your particular simulated test:

If you want a more detailed information and haven't done it already, I recommend reading the paper.

I am closing the issue and locking the conversation. You may contact me in private.