griffithlab / regtools

Integrate DNA-seq and RNA-seq data to identify mutations that are associated with regulatory effects on gene expression.
https://regtools.readthedocs.org
MIT License
120 stars 26 forks source link

cse identify not giving correct indel ranges #83

Open yang-yangfeng opened 7 years ago

yang-yangfeng commented 7 years ago

It seems like cse identify is having trouble inferring the actual length of indels from the vcf, and is just assuming all variants are snvs.

For example:

clinseq_7$ pwd /gscmnt/gc2602/griffithlab/regtools/yafeng/hcc1395/clinseq_7 clinseq_7$ grep -R -i -n "13408142" hcc1395_filtered.vcf 25161:6 13408142 . CCAA . . PASS . clinseq_7$ grep -R -i -n '1340814' ../output/cse_identify_filtered.tsv 184:6 13365894 13408142 JUNC00000183 1 - GT-AG 1 0 0 DA 1 1 1 GFOD1 ENST00000379284,ENST00000379287 6:13408141-13408142

Another example:

clinseq_7$ grep -R -i -n '45438295' hcc1395_filtered.vcf 494:1 45438295 . . ACAC . PASS . clinseq_7$ grep -R -i -n '45438295' ../output/cse_identify_filtered_i50e5.tsv 85:1 45435716 45443987 JUNC00000084 36 - GT-AG 2 1 2 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295 86:1 45438246 45443987 JUNC00000085 2 - GT-AG 1 0 1 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295

It seems like this is happening consistently, as all of the variants listed in the last column of the cse identify output tsvs are snvs (i.e. stop - start = 1).

clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_E.bed | awk '{print $4}' | sort | uniq 1 clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_i50e5.bed | awk '{print $4}' | sort | uniq 1

Could definitely be related to the vcf since we were having troubles with it before. Investigating.

malachig commented 7 years ago

So there are two related problems here:

Currently regtools only takes into account the first two columns (chr, start) when looking at VCF files. This might cause large deletions to be discounted if only the start is taken into account unless we look at the length of the deletion as well.

Since only the start position is being considered, in the regtools output, insertions and deletions are being misrepresented as SNVs. This creates confusion when comparing back to the input VCF