Error using median_vaf_purity on a combined-polyphen-snpeff.vcf

jburos commented 7 years ago

Getting the following error when trying to estimate median_vaf_purity on a combined/merged VCF file.

/home/jacquelineburos/projects/cohorts/cohorts/functions.py in grab_vaf(variant)
    271     def grab_vaf(variant):
    272         filterable_variant = FilterableVariant(variant, variants, patient)
--> 273         return variant_stats_from_variant(variant, filterable_variant.variant_metadata).tumor_stats.variant_allele_frequency
    274     vafs = [grab_vaf(variant) for variant in variants]
    275     return 2 * pd.Series(vafs).median()

/home/jacquelineburos/projects/cohorts/cohorts/variant_stats.py in variant_stats_from_variant(variant, metadata, merge_fn)
    189             stats = mutect_somatic_variant_stats(variant, variant_metadata)
    190         else:
--> 191             raise ValueError("Cannot parse sample fields, variant file {} is from an unsupported caller.".format(variant_file))
    192         all_stats.append(stats)
    193     return merge_fn(all_stats)

ValueError: ('Cannot parse sample fields, variant file /mnt/disks/data/xxxx-1normals-1tumors--b37decoy/passed_somatic_combined_polyphen-snpeff.vcf is from an unsupported caller.', 'occurred at index 0')

tavinathanson commented 7 years ago

@jburos it currently looks for "mutect" or "strelka" in the filename to figure out how to parse the file. Silly, I know. We need to do better.

jburos commented 7 years ago

@tavinathanson yep, i figured it was something like this. just noting for now, it's not urgent atm. seems like having an "other" category & parsing like a generic vcf would be useful -- i assume there is a generic vcf format?

tavinathanson commented 7 years ago

@jburos my understanding is that we can't rely on a standard format for things like depth and VAF, hence needing to understand the caller in order to grab them. But we should be able to infer the format.

armish commented 7 years ago

Here is the backstory on this one: Strelka's VCF implementation is not complete and VCF validators fail them because of the weird choice of fields/annotations they use in the VCF. The funny things is that its developers are aware of this and the whole Biostars has been complaining about this, but still no change/fix.

The main problem with Strelka in terms of the VAFs is that Strelka doesn't output a VAF field since the number of reads supporting each allele is coming from simulations. This is especially true for the indels and although there is a way to get a VAF-like number out for these variants, it is often not in line with the other tools' estimation. The only exception to this is the variants that are called by both callers, where we do see almost exactly the same VAF across tools (Search the bladder repo for keyword Strelka and there is a notebook that is investigating this issue). So that is why the VAF extraction fails with Strelka variants, since there is no VAF in those VCFs :(

Furthermore, MuTeCT does a terrible job with naming the variant columns and I found that the only way to tell a tumor sample from the normal one is to look within the header. This is very mutect-specific and therefore fails on non-mutect/GATK files.

Jacki, if you are doing lots of VCF reads, you might want to drop all "REJECT"ed variants from the mutect file to speed up the process. PyVCF is just terrible with huge VCFs and I assuming it should be taking quite a lot of time to process the mutect ones. Just FYI.

jburos commented 7 years ago

Yikes - @armish thanks for this backstory. It all sounds pretty horrifying. Seems like everyone would benefit from a "format normalization" tool which might sanitize/fix vcfs into a sane format, and perhaps address some of these issues along the way.

On a more practical angle, what is the best way for me to confirm that I am processing the vcf files correctly? This makes me a little nervous to think we might be misreading them accidentally.

armish commented 7 years ago

@jburos: unfortunately there is no silver bullet here: pyvcf is pretty robust and is the de facto library for the field so I trust it; but unless you make sure you have the cython libraries installed up-front, its default Python-based implementation can become too slow for large VCFs (one solution is to pre-filter VCFs down to non-REJECTED ones). There is a relatively new variation of pyvcf valled cyvcf which is blazing fast and builds on top of pyvcf methods. It is great for reading a large VCF file if performance is an issue, but extracting variant annotations is a bit harder/non-intuitive compared to pyvcf.

If you are interested in ETLing the annotations, then either pyvcf or cyvcf is the way to go. If you are only interested in the variants themselves, then varcode.load_vcf does a decent job by utilizing pandas's table read (which speeds things up).

But as a rule of thumb: I suggest always checking for None values when getting a variant annotation and also never assuming that the VCF is uniform.

hammerlab / cohorts

Error using median_vaf_purity on a combined-polyphen-snpeff.vcf #228