tonydisera / gene.iobio

An iobio app for examining gene variants
22 stars 4 forks source link

Big differences in called vs loaded variants #537

Open AlistairNWard opened 7 years ago

AlistairNWard commented 7 years ago

Using the files from Brandi, there are significant differences between the variants when loaded vs when called in BRAF. In particular, the loaded variants include a high impact, clinVar pathogenic stop gained SNP at chr7:140794409 with 114 alternate alleles out of 599 depth, which is missed when only the BAM is supplied.

https://s3.amazonaws.com/iobio/samples/test_files/Horizon-HD728_Panel1385_Hyb11_HFW2KBBXX.bam

https://s3.amazonaws.com/iobio/samples/test_files/Horizon-HD728_Panel1385_Hyb11_HFW2KBBXX.union.vcf.gz

tonydisera commented 7 years ago

This is concerning. Al, do you think that freebayes isn't calling these variants due to a missing parameter or do you suspect that something at the app level is somehow omitting these variants?

Also, I guess with the way we show 'called variants' now, it isn't obvious which variants are missed by Freebayes. The 'delta' of loaded vs called variants probably needs a better visualization.

AlistairNWard commented 7 years ago

I'm not totally sure, but I was going to try calling these on the command line and just confirm that Freebayes misses them. If that is the case, I can try and drill down to why it misses and if I can recover them with a change of parameters. At first glance, the amount of evidence seems to sufficient to call the variant, so it would seem strange to me that we would need to alter parameters to find them.

I think we have discussed the idea of showing which of the loaded variants are not recalled by Freebayes as a way of calling variants into question, but I'm not sure it ever made it as an issue. I'll take a look, and keep you posted on the calling.

AlistairNWard commented 7 years ago

The reason the clinVar variant is missed is that as a default, Freebayes uses -F 0.2, which means that the alternate fraction must be >0.2 (e.g. at least 20% of the reads show the alternate allele). We may need to distinguish between cancer/non-cancer samples, as this value will be unreasonable when calling in a tumour sample with normal cell contamination. In this case, the mutation has ~60 alternate observations in ~600 reads. But, even for non-cancer samples, we are looking for variants that may be in the weeds, so we should put in -F 0.02 or something, but this may need to be depth dependent. Alternatively, we could set -F 0. and up the minimum number of observed alleles to 2 or 3. Obviously, as we go down this path, we are going to pick up a lot of errors, but maybe we start using the filters proactively, and even though there are a lot, we work through the filter sets looking for rare, high impact and then dig into them.

Problem 2. This variant is a multiallelic, so even if we drop -F down, we still won't annotate correctly, so we probably need to address multiallelics. I will look into this a bit more.