paulhager / smart-phase

A comprehensive and intelligent clinical phasing tool
GNU General Public License v3.0
13 stars 2 forks source link

Possible length issue with indels in TSV file #8

Closed mattjmeier closed 3 years ago

mattjmeier commented 3 years ago

Hello,

I wanted to alert you to another potential file writing issue - I was just reading some of the TSVs into R and I could not figure out for the life of me why it wasn't happy with the number of columns, as everything looked fine. It happened to be another edge case I think... in which only two columns were written because the length of an indel took up the whole line? See below for the problematic example. It only happened in one of my ~50 individuals I think.

All the best! Matt

1-1247563-1248563       chr1-1248091-GTGGGCAGCCCTGGGAGGCTGGACTGAGGGAGGCTGGACTTCCCACTCAGGCCTACACGCAGGAAAA-G
1-1378594-1379594       chr1-1378635-C-T        chr1-1378792-C-T        1       1.0
paulhager commented 3 years ago

Hmmmm. Shouldn't make a difference how big the indel is. Can you double check to see if the BED file range that variant is in has other variants within its range? i.e. if there are other variants in 1247563-1248563 on chr1. B/c I think this is intended behavior insofar that there are no variant combinations to be made (since there is only one variant in the interval) and so the variant is just printed as-is in the output.

paulhager commented 3 years ago

Closing this issue as I assume SmartPhase is working as intended. If I was wrong in my reasoning, feel free to reopen

mattjmeier commented 3 years ago

Hi Paul, I assume you're correct. I will let you know if I find anything unusual, I haven't looked into the actual data yet.

For downstream analyses it might be useful to include "NA" or some such value when there is a blank column - only because it makes it difficult to clean the data afterwards. The reason I was doing this is so I could import the probability score and merge it with the genotype information from the VCF. Another useful alternative would be to have the VCF spit out a 3rd tag, such as "phasing probability" or something.

paulhager commented 3 years ago

Makes sense! I'll try to implement that this week.