sigven / pcgr

Personal Cancer Genome Reporter (PCGR)
https://sigven.github.io/pcgr
MIT License
251 stars 47 forks source link

ALT allele "-" in civic.vcf.gz #179

Closed ddrichel closed 2 years ago

ddrichel commented 2 years ago

In the data bundle pcgr.databundle.grch38.20210627.tgz, in data/grch38/civic/civic.vcf.gz there is an entry with ALT allele "-":

4 54728011 455594177-_- C - . PASS CIVIC_ID=EID2475

I understand that full compliance with the vcf specification might be outside the scope of the project, but this looks like there might be a bigger underlying issue. This is the only non-alphanumeric allele I found in the vcfs in the whole data bundle.

Unrelated note: there is a FORMAT field in the header of data/grch38/tcga/tcga.vcf.gz , but no corresponding column, unlike in other vcfs, which have correct column names. This prevents bcftools from reading the file.

Thanks

Dmitriy

sigven commented 2 years ago

Hi Dmitriy,

Thanks a lot for reaching out. This is very useful feedback and warrants further QC steps of the data bundled with PCGR. I will set up a routine for this, e.g. using bcftools, ensuring that the bundled VCFs are not wrongly encoded.

With respect to full compliance with the VCF specification, that is indeed an interesting topic. I have previously had the vcf-validator as part of a validation step in PCGR (ensuring that input files followed the VCF requirements), but our experience was that very few users had their VCFs in full compliance with the specification. And we got many questions as to why their VCF's were not accepted by PCGR. We have thus removed this validation step altogether in the latest dev version. In the latest version, the error you encountered in the civic VCF is no longer present I believe (new databundle), while the FORMAT error in the TCGA vcf is still there, so I will fix that one.

Installation instructions: https://sigven.github.io/pcgr/articles/installation.html Code: https://github.com/sigven/pcgr/tree/dev

Thanks again for filing an issue for this, highly appreciated.

best, Sigve

ddrichel commented 2 years ago

Hi Sigve,

thanks, I was not aware of the 20220203 version of the data bundle. I absolutely see how vcf validation is a complex and often fuzzy issue. For my purposes, a vcf is compliant enough if it can be processed by bcftools.

Appreciate the helpful feedback,

Dmitriy

sigven commented 2 years ago

Yeah, there is really no reason why you should have been aware of it.. we have tried to keep it somewhat unofficial during the latest weeks of testing :-). Thanks for reaching out.

Cheers, Sigve