Open gibberwocky opened 7 months ago
OK, I've picked up on the cause of this after running the code interactively, and also note that it is mentioned in the Technical Notes section (https://github.com/nrlab-CRUK/INVAR2/blob/master/docs/TechnicalNotes_FAQ.md). The technical notes states:
You have a loci that has been called as a tumour mutation in two or more samples (so a duplicate line in the tumour mutations list csv) which leads to having two lines in the mpileup, and crashes sizeAnnotation when comparing pysam and mpileup results. To resolve: remove the duplicated loci in your tumour mutations file.
And the tumour mutations file section in 'SettingUp' states:
Note: please ensure each loci is called only once per patient! (ie no duplicates in the list) else this would cause issues in the sizeAnnotation.R step.
My take from this, based on the error, is that the workflow will not handle multiallelics in a patient. If a patient has a germline variant of T/A, and a somatic variant at the same position of T/C, then we only include the T/C variant in the tumour mutations file. So we need to remove the patient's germline variants (i.e. from normal sequence data) from the variants to be included in the tumour mutations file. This seems like it should be obvious, assuming it is the case, but I'm coming from an animal genetics background, rather than a cancer genetics background, so I took the following text:
This is a CSV file listing patient mutations.
To indicate that any variant called relative to the reference should be included. I'll exclude germline variants and see if the workflow progresses.
Addressing the above solved that problem. So a misunderstanding on my part.
I've since essentially run into the same issue, but when trying with a different dataset. In this instance, an example error occurs with the following:
> data.frame(mutationsTable[1:2,])
CHROM POS REF ALT DP DP4 REF_F ALT_F REF_R ALT_R MQSB
1 38 11112512 G A 2561 295,2265,24,1 295 0 2265 1 1
2 38 11112512 G C 2584 295,2265,24,1 295 24 2265 0 1
SAMPLE_ID COSMIC_MUTATIONS COSMIC_SNP X1KG_AF TRINUCLEOTIDE
1 cfDNA_Captured_325_S13 0 FALSE 0 GCG
2 cfDNA_Captured_325_S13 0 FALSE 0 GCG
AF COSMIC SNP ON_TARGET PATIENT CASE_OR_CONTROL TUMOUR_AF
1 0.0002577984 FALSE FALSE TRUE S325 case 0.04716981
2 0.0061871616 FALSE FALSE TRUE S325 case 0.04716981
MUTATION_CLASS PATIENT_MUTATION_BELONGS_TO BACKGROUND_MUTATION_SUM
1 C/G S265 0
2 C/G S265 0
BACKGROUND_DP BACKGROUND_AF LOCUS_NOISE.PASS BOTH_STRANDS.PASS
1 71068 0 FALSE FALSE
2 71068 0 FALSE FALSE
CONTAMINATION_RISK.PASS MUTATED_READS_PER_LOCI UNIQUE_POS UNIQUE_ALT
1 TRUE 1 38:11112512 38:11112512_G/A
2 TRUE 24 38:11112512 38:11112512_G/C
PATIENT_SPECIFIC
1 FALSE
2 FALSE
This particular variant site is present in the tumour_mutations.csv file for five samples:
38,11112512,G,C,0.0471698113207547,S265
38,11112512,G,C,0.0838323353293413,S305
38,11112512,G,C,0.152046783625731,S306
38,11112512,G,C,0.108695652173913,S331
38,11112512,G,A,0.130434782608696,S349
Instructions in SettingUp indicate:
Note: please ensure each loci is called only once per patient! (ie no duplicates in the list) else this would cause issues in the sizeAnnotation.R step.
As the loci is present only once in each sample, this should not be a problem. What have I missed?
I've run into an error when attempting to run the Nextflow pipeline, so thought I'd reach out in case it's something you've encountered and can quickly identify what might be causing it. Some of the output is outlined below.
Thanks in advance.
D