ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Coverage ? #67

Closed ptranvan closed 5 years ago

ptranvan commented 5 years ago

I am looking for coverage information for each contig, do you display it somewhere ?

ruanjue commented 5 years ago

Please find the edge coverage by counting reads for each E block in <prefix>.ctg.lay file.

cjchen5 commented 5 years ago

Hi Jue, Do you mean, for example 'E 0 N162023 - N164002 -', so I need to calculate how many 'S' under it berfore next 'E'(eg, E 1280 N164002 - N170250 -)? If there are 40 'S*' between 'E 0 N162023 - N164002 -' and 'E 1280 N164002 - N170250 -', that means the coverage of 'E 0 N162023 - N164002 -' is 40X? Thanks.

ruanjue commented 5 years ago

Yes, it is 40X.

E starts a EDGE block, which containing many S, the read coverage of this edge is counted by number of S.

cjchen5 commented 5 years ago

Thank you!

cjchen5 commented 5 years ago

Hi Jue, Could you tell me in 'S 365f4a72-3bc1-4374-a99a-70bde482212c - 15872 2304' what's the '365f4a72-3bc1-4374-a99a-70bde482212c' and '15872' means? And after get the coverage of this edge as you mention, what should I do to get the coverage per each contig? Calculate the average of coverage of each edge in one contig? Thanks again.

ruanjue commented 5 years ago

Please have a look at https://github.com/ruanjue/wtdbg2/blob/master/README-ori.md#output

>ctg(\d+) nodes=(\d+) len=(\d+)
E <OFFSET> <NODE1> <STRAND1> <NODE2> <STRAND2>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
...
E ...
...
cjchen5 commented 5 years ago

Thank you! Could you please tell me whether my understandings are right? 1.Nodes is reads number in this contig and EDGE are one overlap in contig?

  1. 0 of is first position of that contig and 0 of is the first position of that contig or first position of that read? 3.After get the coverage of this edge as you mention, what should I do to get the coverage per contig? Calculate the average of coverage of each edge in one contig?
ruanjue commented 5 years ago

1, NODE and EDGE are defined in FBG (see https://www.biorxiv.org/content/10.1101/530972v1). Or, you can simplely imagine it as in DBG. 2, 0 for NODE is estimated offset in that contig, 0 for REG_OFFSET is the offset on that read. 3, ctg_acg_cov = sum(edge_coverage * edge_len) / sum(edge_len). Please note the various egde lengths.

cjchen5 commented 5 years ago

Thank you! I notice that some EDGE have different length reads. How to get edge_len for these edge? E1_len=E2-E1 or calculate average length of all S? Or may I directly ctg_acg_cov = sum(REG_LENGTH) / contig_len?

ruanjue commented 5 years ago

ctg_acg_cov = sum(REG_LENGTH) / contig_len looks better to be calculated.

cjchen5 commented 5 years ago

Cool. Thank you so much!

cjchen5 commented 5 years ago

Hi jue, Here is some line in my .ctg.lay file.

ctg1 nodes=29052 len=29319936 E 0 N181398 + N174401 + S 7ea0e8a2-1efa-4019-b5df-02f4d8477d07 + 0 2048
S 402b6461-4a60-46c8-9cb2-f89013f0f68e + 256 1792
S a1d638b2-01af-4a80-85df-eba51b7a5936 - 3584 1792
E 1792 N174401 + N145000 -

I'm confused why the length of first EDGE is 1792 but there are one length of read is 2048. Could you please help to explain this? In this case, the two methods we mentioned to calculating the coverage per contig will result in different result. Thanks.

ruanjue commented 5 years ago

Besides the high in/dels rate in sequencing, wtdbg2 employs kmer-bin-mapping method to align reads, which counts the sequences in bin (256 bp). 1792 + 256 = 2048, so those fragments varies at 1 bin. If look more, you will find larger difference of length, duing to diploid or repeat.

It is ok. the coverage should NOT be used in an exactly way.

Jue

cjchen5 commented 5 years ago

Thank you! So ctg_acg_cov = sum(REG_LENGTH) / contig_len seems not good to get the coverage? In this case, ctg_acg_cov = sum(edge_coverage * edge_len) / contig_len is better to calculate, right?

ruanjue commented 5 years ago

No, I think sum(reg_len) / ctg_len is easier and better.

cjchen5 commented 5 years ago

But if edge_len is 1792 and S length is 2048, dose that means there are 256 bases didn't use when it become a EDGE? Or, you mean because of the high in/dels rate in sequencing, diploid or repeat, these 256 bases is in the middle of that S when it turns to EDGE. So that 256 bases also contribute to EDGE and we still need to count that bin length when we calculate the coverage? Thanks.

ruanjue commented 5 years ago

When you look at the alignments of long noisy reads, you will find many IN/DELs, thusly the same within an edge.