shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
182 stars 13 forks source link

Two questions related to the Fig1 in the paper #48

Open caojy-sys opened 2 months ago

caojy-sys commented 2 months ago

Hi! I have two questions about Fig 1 in the paper. The first one is about the indexing part. In Block 1, the number of columns is n-1. What exactly does this n-1 refer to? Does the n-1 (number of columns) change depending on different blocks? The other question is about the filtering step in the Profiling part. Why should KMCP have three filtering steps? Why not just use the second filter step (the most rigorous round) so that KMCP can only have one round filtering step? These are the questions that I am concerned about. Hope you can reply as soon as possible. Thank you!

shenwei356 commented 2 months ago

In Block 1, the number of columns is n-1. What exactly does this n-1 refer to?

Does the n-1 (number of columns) change depending on different blocks?

Yes. The size of bloom filters is determined by the expected false-positive rate and the length of the largest sequence in a block.

Why not just use the second filter step (the most rigorous round) so that KMCP can only have one round filtering step?

caojy-sys commented 2 months ago

Sorry for the misunderstanding regarding "rows" and "columns," thank you so much!

caojy-sys commented 2 months ago

Sorry. I'm still confused about some questions. In the three-round filtering step, why should KMCP use the first filtering round and the third filtering round as the second filtering round is the strictest one so that we don't actually need the first and the third filtering round?

image

Next, what's the difference between Block 1 and Block B? Why should the lengths of R1-c1 and R4-c1 be different? Are the Block 1 and Block B independent of each other? Can we seem the Block 1 as a matrix?

image
shenwei356 commented 2 months ago

In the three-round filtering step, why should KMCP use the first filtering round and the third filtering round as the second filtering round is the strictest one so that we don't actually need the first and the third filtering round?

After a round of filtering, some multiple reads are assigned to fewer species (even one, making them uniquely matched reads), so another round is needed to use the new statistics to recompute detected reference genomes. However, there's no need to use many rounds, two or three is enough. Like the EM algorithm, each round of filtration improves the result, but it will soon reach the plateau.

In R3, the grey diamond criterion is not used cause it is slow, and there's no need according to my observation.

Next, what's the difference between Block 1 and Block B?

Each block contains data from different genome chunks. In the figure, chunks of a genome are next to each other for simplicity. Actually, k-mers of all chunks are sorted, and divided into multiple blocks. You can also read the COBS paper.

Why should the lengths of R1-c1 and R4-c1 be different?

They might be different or the same. The size of bloom filters of a block is determined by the FPR and the largest k-mer numbers in the block.

Are the Block 1 and Block B independent of each other?

Yes.

Can we seem the Block 1 as a matrix?

It is.