Closed ptranvan closed 5 years ago
Please find the edge coverage by counting reads for each E
block in <prefix>.ctg.lay
file.
Hi Jue, Do you mean, for example 'E 0 N162023 - N164002 -', so I need to calculate how many 'S' under it berfore next 'E'(eg, E 1280 N164002 - N170250 -)? If there are 40 'S*' between 'E 0 N162023 - N164002 -' and 'E 1280 N164002 - N170250 -', that means the coverage of 'E 0 N162023 - N164002 -' is 40X? Thanks.
Yes, it is 40X.
E
starts a EDGE block, which containing many S
, the read coverage of this edge is counted by number of S
.
Thank you!
Hi Jue, Could you tell me in 'S 365f4a72-3bc1-4374-a99a-70bde482212c - 15872 2304' what's the '365f4a72-3bc1-4374-a99a-70bde482212c' and '15872' means? And after get the coverage of this edge as you mention, what should I do to get the coverage per each contig? Calculate the average of coverage of each edge in one contig? Thanks again.
Please have a look at https://github.com/ruanjue/wtdbg2/blob/master/README-ori.md#output
>ctg(\d+) nodes=(\d+) len=(\d+)
E <OFFSET> <NODE1> <STRAND1> <NODE2> <STRAND2>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
...
E ...
...
Thank you! Could you please tell me whether my understandings are right? 1.Nodes is reads number in this contig and EDGE are one overlap in contig?
1, NODE and EDGE are defined in FBG (see https://www.biorxiv.org/content/10.1101/530972v1). Or, you can simplely imagine it as in DBG. 2, 0 for NODE is estimated offset in that contig, 0 for REG_OFFSET is the offset on that read. 3, ctg_acg_cov = sum(edge_coverage * edge_len) / sum(edge_len). Please note the various egde lengths.
Thank you!
I notice that some EDGE have different length reads. How to get edge_len for these edge? E1_len=E2
ctg_acg_cov = sum(REG_LENGTH) / contig_len
looks better to be calculated.
Cool. Thank you so much!
Hi jue,
Here is some line in my
ctg1 nodes=29052 len=29319936 E 0 N181398 + N174401 + S 7ea0e8a2-1efa-4019-b5df-02f4d8477d07 + 0 2048
S 402b6461-4a60-46c8-9cb2-f89013f0f68e + 256 1792
S a1d638b2-01af-4a80-85df-eba51b7a5936 - 3584 1792
E 1792 N174401 + N145000 -
I'm confused why the length of first EDGE is 1792 but there are one length of read is 2048. Could you please help to explain this? In this case, the two methods we mentioned to calculating the coverage per contig will result in different result. Thanks.
Besides the high in/dels rate in sequencing, wtdbg2 employs kmer-bin-mapping method to align reads, which counts the sequences in bin (256 bp). 1792 + 256 = 2048, so those fragments varies at 1 bin. If look more, you will find larger difference of length, duing to diploid or repeat.
It is ok. the coverage should NOT be used in an exactly way.
Jue
Thank you! So ctg_acg_cov = sum(REG_LENGTH) / contig_len seems not good to get the coverage? In this case, ctg_acg_cov = sum(edge_coverage * edge_len) / contig_len is better to calculate, right?
No, I think sum(reg_len) / ctg_len is easier and better.
But if edge_len is 1792 and S length is 2048, dose that means there are 256 bases didn't use when it become a EDGE? Or, you mean because of the high in/dels rate in sequencing, diploid or repeat, these 256 bases is in the middle of that S when it turns to EDGE. So that 256 bases also contribute to EDGE and we still need to count that bin length when we calculate the coverage? Thanks.
When you look at the alignments of long noisy reads, you will find many IN/DELs, thusly the same within an edge.
I am looking for coverage information for each contig, do you display it somewhere ?