Closed ddrichel closed 2 years ago
Hi Dmitriy,
Thanks a lot for reaching out. This is very useful feedback and warrants further QC steps of the data bundled with PCGR. I will set up a routine for this, e.g. using bcftools
, ensuring that the bundled VCFs are not wrongly encoded.
With respect to full compliance with the VCF specification, that is indeed an interesting topic. I have previously had the vcf-validator
as part of a validation step in PCGR (ensuring that input files followed the VCF requirements), but our experience was that very few users had their VCFs in full compliance with the specification. And we got many questions as to why their VCF's were not accepted by PCGR. We have thus removed this validation step altogether in the latest dev version. In the latest version, the error you encountered in the civic VCF is no longer present I believe (new databundle), while the FORMAT error in the TCGA vcf is still there, so I will fix that one.
Installation instructions: https://sigven.github.io/pcgr/articles/installation.html Code: https://github.com/sigven/pcgr/tree/dev
Thanks again for filing an issue for this, highly appreciated.
best, Sigve
Hi Sigve,
thanks, I was not aware of the 20220203 version of the data bundle. I absolutely see how vcf validation is a complex and often fuzzy issue. For my purposes, a vcf is compliant enough if it can be processed by bcftools.
Appreciate the helpful feedback,
Dmitriy
Yeah, there is really no reason why you should have been aware of it.. we have tried to keep it somewhat unofficial during the latest weeks of testing :-). Thanks for reaching out.
Cheers, Sigve
In the data bundle pcgr.databundle.grch38.20210627.tgz, in data/grch38/civic/civic.vcf.gz there is an entry with ALT allele "-":
4 54728011 455594177-_- C - . PASS CIVIC_ID=EID2475
I understand that full compliance with the vcf specification might be outside the scope of the project, but this looks like there might be a bigger underlying issue. This is the only non-alphanumeric allele I found in the vcfs in the whole data bundle.
Unrelated note: there is a FORMAT field in the header of data/grch38/tcga/tcga.vcf.gz , but no corresponding column, unlike in other vcfs, which have correct column names. This prevents bcftools from reading the file.
Thanks
Dmitriy