lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
202 stars 21 forks source link

Shorter segments than overlaps in GFA #45

Closed AntoineHo closed 3 years ago

AntoineHo commented 3 years ago

Hello,

While looking at the GFA output of raven, I noticed that some links have longer overlaps than the segment size allows. For instance:

S  ch227_read12486_template_pass_FAH31515     LN:i:15968       RC:i:1
S  ch96_read20376_template_pass_FAH42885      LN:i:5840        RC:i:1
L  ch227_read12486_template_pass_FAH31515     ch96_read20376_template_pass_FAH42885       -       6422M

Is this expected?

Cheers

rvaser commented 3 years ago

Hi, it is probably due to indels as the overlap is calculated via minimizers and not alignment. You can send me the raven.cereal file so we can be sure.

Best regards, Robert

AntoineHo commented 3 years ago

Thank you for your quick reply, here is the file: https://we.tl/t-o5kc0TBYQJ

Best regards, Antoine

rvaser commented 3 years ago

Seems alright to me, the edge pair has overlap of length 5816.

AntoineHo commented 3 years ago

Ok, so the reported overlap in the GFA is overestimated. I wanted to merge some paths in the graph but I will simply ignore these links where segment length is smaller than reported overlaps.

rvaser commented 3 years ago

The safest thing to do is find suffix-prefix overlaps for given links and join accordingly.

AntoineHo commented 3 years ago

Ok, thanks, I will realign the segments from links where the overlap is larger.