tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Question: FLANK_ASSEMBLY_CYCLIC #38

Closed nh13 closed 6 years ago

nh13 commented 7 years ago

First of all, thank-you for all your help with all my questions and issues.

Is there any place (ex. wiki) that defines the various filters (ex. FLANK_ASSEMBLY_CYCLIC)? In my specific example, HipSTR is preferentially filtering all samples with a specific allele, causing allelic dropout. I'd be happy to share a test case if that helps.

tfwillems commented 7 years ago

@nh13

Unfortunately not at the moment. I need to update the README on the main page to explain the following filters:

  1. FLANK_ASSEMBLY_CYCLIC
  2. FLANK_ASSEMBLY_INDEL
  3. FLANK_INDEL_FRAC
  4. NO_READS

For background about FLANK_ASSEMBLY_CYCLIC, HipSTR first aligns reads to all candidate STR haplotypes assuming that the sequences to the left/right of the repeat (i.e. the flanks) match the reference genome. It then uses these initial alignments to i) extract the sequences aligned to the flanks and ii) build separate De Bruijn graphs for the left/right flanks to identify potential polymorphisms there. If any candidate polymorphisms are found, it realigns reads to candidate haplotypes that incorporate these flanking polymorphisms. This may seem rather convoluted compared to just building De Bruijn graphs right away, but we chose to avoid this approach as the long repeats are problematic for assembly.

When assembling the flanks, we iteratively increase the kmer from 10-15 until no cycles are present in the assembly graph. FLANK_ASSEMBLY_CYCLIC filtering arises if at a kmer length of 15, the graph still contains a cycle.

This filter typically arises if 1) there really is a long repeat right next to the STR or 2) a read with a long expansion failed the initial alignment and the repeat was incorrectly positioned in the flank. This repeat in turn induced a cycle in the flank' assembly graph.

It seems like either of these may be responsible in your case. Happy to take a look if you can send me some example data

nh13 commented 7 years ago

Thank-you for the clear explanation; I have sent you an interesting example.

Another idea that came to mind was if HipSTR could annotate which filter was applied for each sample, so I can easily associate which filter was applied (ex. FLANK_ASSEMBLY_CYCLIC) to which sample, for example in the FORMAT field. There could even be a --debug flag in case you don't want to bloat the VCF. I find knowing when any filters are applied warrants a deeper look on my data.

tfwillems commented 7 years ago

@nh13 I've added a command line option --output-filters that adds a FILTER FORMAT field to each individual call. For filtered genotypes, it'll indicate the reason, while unfiltered genotypes will be labeled with PASS. You can find the relevant changes in commit 6719ed809ab762579a79f54a8431fa3c16580b6e

tfwillems commented 6 years ago

I finally got around to adding a description of these filters in commit ca06f553d8cdf9ce0082ed1be7762f15755e3512