nrlab-CRUK / INVAR2

restructures version of invar
5 stars 4 forks source link

Error executing process > sizeAnnotation:annotateMutationsWithFragmentSize #6

Open gibberwocky opened 5 months ago

gibberwocky commented 5 months ago

I've run into an error when attempting to run the Nextflow pipeline, so thought I'd reach out in case it's something you've encountered and can quickly identify what might be causing it. Some of the output is outlined below.

Thanks in advance.

D

ERROR ~ Error executing process > 'sizeAnnotation:annotateMutationsWithFragmentSize (BMD_LSA1_DNA_S20)'

Caused by:                                                              
  Process `sizeAnnotation:annotateMutationsWithFragmentSize (BMD_LSA1_DNA_S20)` terminated with an error exit status (1)

Command executed:                                                               

  Rscript --vanilla "/mnt/scratch/David/.nextflow/assets/nrlab-CRUK/INVAR2/R/2_size_annotation/sizeAnnotation.R"             --mutations="mutation_table.on_targ
et.rds"             --fragment-sizes="BMD_LSA1_DNA_S20.inserts.tsv"             --sample="BMD_LSA1_DNA_S20"             --threads=4

Command exit status:                                                            
  1                                                                             

Command output:                                                                 
  (empty)

Command error:
  Have 0 mutants from mpileup at 13:56055323 but have 1 rows from pysam.
  Have 88 wild types from mpileup at 15:32647984 but have 1776 rows from pysam.
  Have 0 mutants from mpileup at 14:13726714 but have 2 rows from pysam.
  Have 0 mutants from mpileup at 14:13726716 but have 4 rows from pysam.
  Have 0 mutants from mpileup at 14:13726715 but have 18 rows from pysam.
  Have 1 mutants from mpileup at 14:18181914 but have 3 rows from pysam.
  Have 0 mutants from mpileup at 15:32647984 but have 132 rows from pysam.
  Have 0 mutants from mpileup at 15:32647983 but have 220 rows from pysam.
  Have 0 mutants from mpileup at 16:44225978 but have 1 rows from pysam.
  Have 0 mutants from mpileup at 17:26845263 but have 10 rows from pysam.
  Have 0 mutants from mpileup at 16:44232819 but have 1 rows from pysam.
  Have 2 mutants from mpileup at 17:19489523 but have 4 rows from pysam.
  Have 0 mutants from mpileup at 17:19504695 but have 2 rows from pysam.
  Have 12 wild types from mpileup at 17:22293144 but have 40 rows from pysam.
  Have 4 mutants from mpileup at 1:112082460 but have 6 rows from pysam.
  Have 0 mutants from mpileup at 19:15195346 but have 1 rows from pysam.
  Have 0 mutants from mpileup at 21:38811639 but have 5 rows from pysam.
  Have 0 mutants from mpileup at 21:38811643 but have 2 rows from pysam.
  Have 0 mutants from mpileup at 22:37411036 but have 1 rows from pysam.
  Have 408 wild types from mpileup at 21:38811643 but have 495 rows from pysam.
  Have 0 mutants from mpileup at 21:45907515 but have 1 rows from pysam.
  Have 21 mutants from mpileup at 22:5682631 but have 57 rows from pysam.
  Have 1 mutants from mpileup at 22:5682635 but have 4 rows from pysam.
  Have 0 mutants from mpileup at 25:44728842 but have 2 rows from pysam.
  Have 0 mutants from mpileup at 26:37909891 but have 2 rows from pysam.
  Have 0 mutants from mpileup at 26:7736759 but have 4 rows from pysam.
  Have 1482 wild types from mpileup at 27:5537489 but have 1836 rows from pysam. 
  Have 319 wild types from mpileup at 2:41317493 but have 372 rows from pysam.
  Have 1 mutants from mpileup at 34:14853909 but have 2 rows from pysam.
  Have 1 mutants from mpileup at 36:22414921 but have 2 rows from pysam.
  Have 0 mutants from mpileup at 37:21995539 but have 1 rows from pysam.
  Have 5 mutants from mpileup at 38:20027972 but have 6 rows from pysam.
  Have 794 wild types from mpileup at 4:34664066 but have 1312 rows from pysam.
  Have 3 mutants from mpileup at 7:26626340 but have 4 rows from pysam.
  Have 0 mutants from mpileup at 9:34730757 but have 4 rows from pysam.
  Have 2 mutants from mpileup at 9:41468848 but have 3 rows from pysam.
  Have 3 mutants from mpileup at X:33958979 but have 7 rows from pysam.
  Have 0 mutants from mpileup at X:59989822 but have 1 rows from pysam.
  Error: Argument 1 must have names.
  Backtrace:
      x
   1. +-global::main(parseOptions())
   2. | \-global::equaliseSizeCounts(...)
   3. |   \-`%>%`(...)
   4. +-dplyr::select(., UNIQUE_POS, SIZE, MUTANT)
   5. \-dplyr::bind_rows(.)
  Warning message:
  In mclapply(., sampleFragments, mutationsTableDepths, mc.cores = threads) :
    scheduled cores 2, 1 encountered errors in user code, all values of the jobs will be affected
  Execution halted
gibberwocky commented 5 months ago

OK, I've picked up on the cause of this after running the code interactively, and also note that it is mentioned in the Technical Notes section (https://github.com/nrlab-CRUK/INVAR2/blob/master/docs/TechnicalNotes_FAQ.md). The technical notes states:

You have a loci that has been called as a tumour mutation in two or more samples (so a duplicate line in the tumour mutations list csv) which leads to having two lines in the mpileup, and crashes sizeAnnotation when comparing pysam and mpileup results. To resolve: remove the duplicated loci in your tumour mutations file.

And the tumour mutations file section in 'SettingUp' states:

Note: please ensure each loci is called only once per patient! (ie no duplicates in the list) else this would cause issues in the sizeAnnotation.R step.

My take from this, based on the error, is that the workflow will not handle multiallelics in a patient. If a patient has a germline variant of T/A, and a somatic variant at the same position of T/C, then we only include the T/C variant in the tumour mutations file. So we need to remove the patient's germline variants (i.e. from normal sequence data) from the variants to be included in the tumour mutations file. This seems like it should be obvious, assuming it is the case, but I'm coming from an animal genetics background, rather than a cancer genetics background, so I took the following text:

This is a CSV file listing patient mutations.

To indicate that any variant called relative to the reference should be included. I'll exclude germline variants and see if the workflow progresses.

gibberwocky commented 5 months ago

Addressing the above solved that problem. So a misunderstanding on my part.

gibberwocky commented 5 months ago

I've since essentially run into the same issue, but when trying with a different dataset. In this instance, an example error occurs with the following:

> data.frame(mutationsTable[1:2,])                                                                                                                         
  CHROM      POS REF ALT   DP           DP4 REF_F ALT_F REF_R ALT_R MQSB                                                                                   
1    38 11112512   G   A 2561 295,2265,24,1   295     0  2265     1    1                                                                                   
2    38 11112512   G   C 2584 295,2265,24,1   295    24  2265     0    1                                                                                   
               SAMPLE_ID COSMIC_MUTATIONS COSMIC_SNP X1KG_AF TRINUCLEOTIDE                                                                                 
1 cfDNA_Captured_325_S13                0      FALSE       0           GCG                                                                                 
2 cfDNA_Captured_325_S13                0      FALSE       0           GCG                                                                                 
            AF COSMIC   SNP ON_TARGET PATIENT CASE_OR_CONTROL  TUMOUR_AF                                                                                   
1 0.0002577984  FALSE FALSE      TRUE    S325            case 0.04716981                                                                                   
2 0.0061871616  FALSE FALSE      TRUE    S325            case 0.04716981                                                                                   
  MUTATION_CLASS PATIENT_MUTATION_BELONGS_TO BACKGROUND_MUTATION_SUM                                                                                       
1            C/G                        S265                       0                                                                                       
2            C/G                        S265                       0                                                                                       
  BACKGROUND_DP BACKGROUND_AF LOCUS_NOISE.PASS BOTH_STRANDS.PASS                                                                                           
1         71068             0            FALSE             FALSE                                                                                           
2         71068             0            FALSE             FALSE                                                                                           
  CONTAMINATION_RISK.PASS MUTATED_READS_PER_LOCI  UNIQUE_POS      UNIQUE_ALT                                                                               
1                    TRUE                      1 38:11112512 38:11112512_G/A                                                                               
2                    TRUE                     24 38:11112512 38:11112512_G/C                                                                               
  PATIENT_SPECIFIC                                                                                                                                         
1            FALSE                                                                                                                                         
2            FALSE                     

This particular variant site is present in the tumour_mutations.csv file for five samples:

38,11112512,G,C,0.0471698113207547,S265
38,11112512,G,C,0.0838323353293413,S305
38,11112512,G,C,0.152046783625731,S306
38,11112512,G,C,0.108695652173913,S331
38,11112512,G,A,0.130434782608696,S349

Instructions in SettingUp indicate:

Note: please ensure each loci is called only once per patient! (ie no duplicates in the list) else this would cause issues in the sizeAnnotation.R step.

As the loci is present only once in each sample, this should not be a problem. What have I missed?