Closed jacmarjorie closed 9 years ago
operator error. I see these have been moved to the config tool.
I've duplicated this error and confirmed that the error was not occuring at this commit: https://github.com/ucscCancer/tcgavcf-tool/commit/7cbb6f7074e681911f47ff3f8f3c4a2e1628ff81 (previous to the last couple days this tool has undergone edits).
Immediate thought is that there is a misalignment between the changes of reheader-config and reheader.
The error specifically is that there is a None value in the SDRF output:
SDRF output contains illegal None value:
['f23b3d0d-26a5-4adf-8aec-4994d094465b', 'TCGA-W5-AA33-01A-11D-A417-09', None, 'DNA', '->', 'hg19', '->', '->', '->', '->', '->', '9a6ebf433eb4bcb93be593f74ffa1d3b.bam', 'dbGAP', 'yes', 'genome.wustl.edu:variant_calling:varscan:01', 'varscan.snp.genome.wustl.edu.TCGA-W5-AA33.vcf', '1.1', 'yes', 'Mutations', 'Level 2', 'ucsc.edu_BRCA.Multicenter_mutation_calling_MC3.Level_2.1.0.0']
This is likely to account for a missing header item that I haven't been able to immediately identify.
Also, the center field is duplicated in the config and the reheader tool.
I've identified the source of the issue:
ID=TUMOR throws the SDRF error ID=PRIMARY passes inspection
FAILS:
##SAMPLE=<ID=TUMOR,Description="Primary Tumor",SampleUUID=f23b3d0d-26a5-4adf-8aec-4994d094465b,SampleTCGABarcode=TCGA-W5-AA33-01A-11D-A417-09,AnalysisUUID=cd5d8895-6b13-450f-993b-bff9943dc0d9,File="9a6ebf433eb4bcb93be593f74ffa1d3b.bam",Platform="illumina",Source="dbGAP",Accession="dbGaP",softwareName=<varscan>,softwareVer=<2.4.0>,softwareParam=<--min-coverage 8 --min-coverage-normal 8 --min-coverage-tumor 6 --min-var-freq 0.1 --min-freq-for-hom 0.75 --normal-purity 1.0 --tumor-purity 1.0 --p-value 0.99 --somatic-p-value 0.05>>
PASSES:
##SAMPLE=<ID=PRIMARY,Description="Primary Tumor",SampleUUID=f23b3d0d-26a5-4adf-8aec-4994d094465b,SampleTCGABarcode=TCGA-W5-AA33-01A-11D-A417-09,AnalysisUUID=cd5d8895-6b13-450f-993b-bff9943dc0d9,File="9a6ebf433eb4bcb93be593f74ffa1d3b.bam",Platform="illumina",Source="dbGAP",Accession="dbGaP",softwareName=<varscan>,softwareVer=<2.4.0>,softwareParam=<--min-coverage 8 --min-coverage-normal 8 --min-coverage-tumor 6 --min-var-freq 0.1 --min-freq-for-hom 0.75 --normal-purity 1.0 --tumor-purity 1.0 --p-value 0.99 --somatic-p-value 0.05>>
How fantastic is that?
I didn't fix that because the reheadering tool is supposed to take care of that. The validator checks the sample TCGA ID, and radia has no knowledge of that ID when it runs. The SAMPLE ID also has to match whatever is in the last field(s) of the VCF body, for instance radia uses DNA_TUMOR and RNA_TUMOR so there should be sample headers that use these IDs. I've suggested before that the reheadering tool turns these column headers into matching IDs (PRIMARY, in this case)
Wait, I'm talking about the reheader tool. I'm testing on output from varscan currently.
If we have SAMPLE=<ID=PRIMARY... and the column header says PRIMARY then the dcc validator does not throw SDRF error, and also does not print this error:
Column header contains sample column name 'TUMOR' that does not have a corresponding SAMPLE header
However, if we have SAMPLE=<ID=TUMOR... regardless of the column name (TUMOR or PRIMARY) the SDRF error is thrown.
Agreed that VCF reheader should replace TUMOR (or whatever else this field gets called) with PRIMARY
I see the problem! PRIMARY is indeed hardcoded, because the program needs to distinguish normal for tumor. I can add TUMOR as an alternative. Would that work?
Yes, that should work... If the tools (radia, varscan, etc) are hard coding the column header as TUMOR, we should indeed change the hard coding to say PRIMARY instead. Then the vcf-reheader can take care of the SAMPLE=<ID=PRIMARY portion.
Done!
Wait, Jeltje - did you just fix this in Radia? This is an open issue still for the vcf-reheader tool.
I don't work on the reheadering tool. I fixed it in the mc3/scripts/vcfToArchive script, which is what generated the 'SDRF output contains illegal None value' error.
Somewhere along the lines, the fileDate and reference headers were removed from your galaxy tool. With these fields missing, the dcc_validator throws an error.
I have a local branch for the vcfProcessLog additions, and will fix this in that branch.
PR coming soon.