lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
202 stars 21 forks source link

modify FindValidRegion to handle edge case #43

Closed AustinHartman closed 3 years ago

AustinHartman commented 3 years ago

Hi,

I noticed the function FindValidRegions seems to not pick up the longest valid region if the region continues up to the final position in data. For example, if data = [0, 5, 6, 5, 3, 5, 6, 6, 5] and coverage = 4, the current function would set begin=1 and end=4 whereas this proposed change would set begin=5 and end=9.

I ran the same E. coli data on a baseline build and a build using the modification here, and didn't observe any changes in the final assembly, but perhaps there are datasets where this edge case would apply.

Let me know what you think.

rvaser commented 3 years ago

Hi, it is not an edge case, but the intended implementation. Pile::FindValidRegion searches for the longest region with coverage above 4.

Best regards, Robert

AustinHartman commented 3 years ago

Thanks for clarifying the intended implementation. Under the current implementation I don't think Pile::FindValidRegion finds the longest region with coverage above 4 when the longest region includes the last value in data_. Is this a bug?

rvaser commented 3 years ago

In the example above, the reason why end!=9 is because of the value 3 at position 4. The first and last value in data_ should always equal to 0 (I am +-1 begin/end of each overlap at https://github.com/lbcb-sci/raven/blob/master/src/pile.cpp#L42-L43).

AustinHartman commented 3 years ago

I see - thanks for pointing me to those lines. Still, I don't understand how the implementation would find the longest region in an example such as data=[0, 5, 0, 5, 5, 5, 5] with coverage=4. I would expected (begin, end) to be (3, 7) (or (3, 6) if the final position should always equal 0). It seems the current implementation would return (1, 2) since the region must be looped over completely before updating begin and end.

rvaser commented 3 years ago

data_ will always look like [0, x, x, x, ..., x, x, 0]. Your example would be [0, 5, 0, 5, 5, 5, 5, 0], and begin,end would equal to (3,7). In the case where the last value would be bigger than 4, the implementation would not properly find the last end. But as I stated, the array will always begin with 0 and end with 0.

AustinHartman commented 3 years ago

Ah I see now, thank you for taking the time to explain

rvaser commented 3 years ago

No worries :)