mehrdadbakhtiari / adVNTR

A tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data
http://advntr.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
41 stars 15 forks source link

Fix sequence slicing bug #58

Closed Jong-hun-Park closed 2 years ago

Jong-hun-Park commented 2 years ago

Description

Bug 1:

without full_length=True option, get_reference_positions() function doesn't output the soft clipped positions making the length of the list different from the read length. That caused the wrong indexing of the read sequence.

Solution:

Use full_length option.

Bug 2.

When deletions occur in the read, the current algorithm may get sequences out of the boundary (+-100 bp flanking regions) because the read_region_start and read_region_end can be the same.

For example, reference: ---|left flanking region|TR region|right flanking region|--- ---a read: ---|DDDDDDDDDDDDDDDDDDDDDD*----------------|---

here, 'D' shows deleted bases against the reference sequence. This read has no sequence covering left flanking and the TR region. Only partially covers the right flanking region. But, this read will be recruited because it has alignment before the left flanking region, and after the right flanking region. In this case, the region_start and region_end will be the same at the index of "*" mark, and the final sequence sliced is seq[*: *+100], which is even out of HMM model boundary.

Solution

Count how many bases (in HMM model) are covered by the read (in the TR region and flanking region), and filter the read if it doesn't have enough (10 bp for each flank) flanking regions. Also, when slicing the read, only get the bp that has been covered.