ucscCancer / tcgavcf-tool

0 stars 7 forks source link

SDRF error related to PRIMARY vs TUMOR #1

Closed jacmarjorie closed 9 years ago

jacmarjorie commented 9 years ago

Somewhere along the lines, the fileDate and reference headers were removed from your galaxy tool. With these fields missing, the dcc_validator throws an error.

I have a local branch for the vcfProcessLog additions, and will fix this in that branch.

PR coming soon.

jacmarjorie commented 9 years ago

operator error. I see these have been moved to the config tool.

jacmarjorie commented 9 years ago

I've duplicated this error and confirmed that the error was not occuring at this commit: https://github.com/ucscCancer/tcgavcf-tool/commit/7cbb6f7074e681911f47ff3f8f3c4a2e1628ff81 (previous to the last couple days this tool has undergone edits).

Immediate thought is that there is a misalignment between the changes of reheader-config and reheader.

The error specifically is that there is a None value in the SDRF output:

SDRF output contains illegal None value:
['f23b3d0d-26a5-4adf-8aec-4994d094465b', 'TCGA-W5-AA33-01A-11D-A417-09', None, 'DNA', '->', 'hg19', '->', '->', '->', '->', '->', '9a6ebf433eb4bcb93be593f74ffa1d3b.bam', 'dbGAP', 'yes', 'genome.wustl.edu:variant_calling:varscan:01', 'varscan.snp.genome.wustl.edu.TCGA-W5-AA33.vcf', '1.1', 'yes', 'Mutations', 'Level 2', 'ucsc.edu_BRCA.Multicenter_mutation_calling_MC3.Level_2.1.0.0']

This is likely to account for a missing header item that I haven't been able to immediately identify.

jacmarjorie commented 9 years ago

Also, the center field is duplicated in the config and the reheader tool.

jacmarjorie commented 9 years ago

I've identified the source of the issue:

ID=TUMOR throws the SDRF error ID=PRIMARY passes inspection

FAILS:

##SAMPLE=<ID=TUMOR,Description="Primary Tumor",SampleUUID=f23b3d0d-26a5-4adf-8aec-4994d094465b,SampleTCGABarcode=TCGA-W5-AA33-01A-11D-A417-09,AnalysisUUID=cd5d8895-6b13-450f-993b-bff9943dc0d9,File="9a6ebf433eb4bcb93be593f74ffa1d3b.bam",Platform="illumina",Source="dbGAP",Accession="dbGaP",softwareName=<varscan>,softwareVer=<2.4.0>,softwareParam=<--min-coverage 8 --min-coverage-normal 8 --min-coverage-tumor 6 --min-var-freq 0.1 --min-freq-for-hom 0.75 --normal-purity 1.0 --tumor-purity 1.0 --p-value 0.99 --somatic-p-value 0.05>>

PASSES:

##SAMPLE=<ID=PRIMARY,Description="Primary Tumor",SampleUUID=f23b3d0d-26a5-4adf-8aec-4994d094465b,SampleTCGABarcode=TCGA-W5-AA33-01A-11D-A417-09,AnalysisUUID=cd5d8895-6b13-450f-993b-bff9943dc0d9,File="9a6ebf433eb4bcb93be593f74ffa1d3b.bam",Platform="illumina",Source="dbGAP",Accession="dbGaP",softwareName=<varscan>,softwareVer=<2.4.0>,softwareParam=<--min-coverage 8 --min-coverage-normal 8 --min-coverage-tumor 6 --min-var-freq 0.1 --min-freq-for-hom 0.75 --normal-purity 1.0 --tumor-purity 1.0 --p-value 0.99 --somatic-p-value 0.05>>

How fantastic is that?

Jeltje commented 9 years ago

I didn't fix that because the reheadering tool is supposed to take care of that. The validator checks the sample TCGA ID, and radia has no knowledge of that ID when it runs. The SAMPLE ID also has to match whatever is in the last field(s) of the VCF body, for instance radia uses DNA_TUMOR and RNA_TUMOR so there should be sample headers that use these IDs. I've suggested before that the reheadering tool turns these column headers into matching IDs (PRIMARY, in this case)

jacmarjorie commented 9 years ago

Wait, I'm talking about the reheader tool. I'm testing on output from varscan currently.

If we have SAMPLE=<ID=PRIMARY... and the column header says PRIMARY then the dcc validator does not throw SDRF error, and also does not print this error:

 Column header contains sample column name 'TUMOR' that does not have a corresponding SAMPLE header

However, if we have SAMPLE=<ID=TUMOR... regardless of the column name (TUMOR or PRIMARY) the SDRF error is thrown.

Agreed that VCF reheader should replace TUMOR (or whatever else this field gets called) with PRIMARY

Jeltje commented 9 years ago

I see the problem! PRIMARY is indeed hardcoded, because the program needs to distinguish normal for tumor. I can add TUMOR as an alternative. Would that work?

jacmarjorie commented 9 years ago

Yes, that should work... If the tools (radia, varscan, etc) are hard coding the column header as TUMOR, we should indeed change the hard coding to say PRIMARY instead. Then the vcf-reheader can take care of the SAMPLE=<ID=PRIMARY portion.

Jeltje commented 9 years ago

Done!

jacmarjorie commented 9 years ago

Wait, Jeltje - did you just fix this in Radia? This is an open issue still for the vcf-reheader tool.

Jeltje commented 9 years ago

I don't work on the reheadering tool. I fixed it in the mc3/scripts/vcfToArchive script, which is what generated the 'SDRF output contains illegal None value' error.