mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.com/
12 stars 5 forks source link

Create single somatic MAF #368

Closed evanbiederstedt closed 5 years ago

evanbiederstedt commented 5 years ago

""" 2. Single somatic MAF file • Includes all somatic mutations (point/indels) along with annotation, filtering, FACETS CCFs, neoantigen predictions (neo_ columns from somatic_variants/neoantigen/neoantigen/.neoantigens.maf), etc. • No other MAF to be retained at this time """

MAFs are the worst file format known to man. They are massive tab-delimited text files with dozens and dozens of columns. The first line is something like version #2.4 https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

RE: MAF for each TN pair Basically, we now have 3 MAFs, there big ridiculous data.tables. For each TN pair, we should make this one MAF---add columns to the main MAF.

This is R data.table with merge: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

RE: MAF combined across TN pairs These MAFs can be big, so creating a massive MAF over 1000s of samples may require lots of RAM----let's think about those, or informalities

evanbiederstedt commented 5 years ago

RE: MAF for each TN pair

For a single TN pair, I currently see we output 4 MAFs:

$  find . -name *maf
./germline_variants/mutations/BRCA_00067-T_vs_BRCA_00067-N.germline.maf
./somatic_variants/neoantigen/neoantigen/BRCA_00067-T.neoantigens.maf
./somatic_variants/mutations/BRCA_00067-T_vs_BRCA_00067-N.maf
./somatic_variants/facets_maf
./somatic_variants/facets_maf/BRCA_00067-T_vs_BRCA_00067-N.facets.maf

(1) facets_maf is the same as BRCA_00067-T_vs_BRCA_00067-N.facets.maf, so we can delete one

(2) The FACETS MAF has the 35 extra columns of an "FACETS-annotated MAF" against the original MAF. Therefore, we can delete BRCA_00067-T_vs_BRCA_00067-N.maf and work with BRCA_00067-T_vs_BRCA_00067-N.facets.maf

(3) The neoantigens MAF is currently the using the original MAF.

Here's one approach in R:

library(data.table)
maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns
neo_maf = fread("BRCA_00067-T.neoantigens.maf")  ## 186 columns, 11 of which we need

neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")       

final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols])
fwrite(final_maf, "final_analysis.maf", sep="/t")

And now there is a final MAF final_analysis.maf for the TN pair.

EDIT: We could also do this with pandas from python DataFrame with merge and to_csv(), but I actually think R's data.table is more RAM efficient, and more computationally performant.

evanbiederstedt commented 5 years ago

RE: MAF combined across TN pairs

Let's assume we have multiple final_analysis.maf for each TN pair. e.g. BRCA_00067-T_vs_BRCA_00067-N.final_analysis.maf,BRCA_00117-T_vs_BRCA_00117-N.final_analysis.maf`

Here is how we could combine them in R (which excels for these sorts of massive data.table operations):

library(data.table)

## as an example, using data_dir; maybe it's getwd(), etc. 

list_of_maf_files = list.files(data_dir, "\\.maf$", full.names = TRUE)

## read in the data.table's into a list, possible to use mclapply, but let's avoid this for now
## list_of_dts = lapply(list_of_maf_files, fread)

huge_maf = rbindlist(lapply(list_of_maf_files, fread))
### EDIT this is wrong ---> fwrite(huge_maf, "massive_final_analysis.maf", sep="/t")
fwrite(huge_maf, "massive_final_analysis.maf", sep="\t")

My only reservation here is that I have no idea how memory intensive this is. I believe at a certain point, data.table will start using cached memory.

allanbolipata commented 5 years ago

@Evan did you already do this? It looks like you did it... is this tested/ran anywhere?

I'm really confused.

evanbiederstedt commented 5 years ago

I'm really confused.

We haven't agreed upon how we are implementing this, so I went through each of these issues, and tried to define what precisely Barry/Mark want.

allanbolipata commented 5 years ago

@evanbiederstedt I tried your code and got this error using BRCA_00067-T_vs_BRCA_00067-N.facets.maf and BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf as inputs. You can see them at /juno/work/pi/cmopipeline/merge_mafs_data/ to test yourself.

This was where the error was.

> fwrite(final_maf, "final_analysis.maf", sep="/t")
Error in fwrite(final_maf, "final_analysis.maf", sep = "/t") :
  is.character(sep) && length(sep) == 1L && nchar(sep) == 1L is not TRUE

I am using an image from the Dockerfile in

containers/merge-facets-neoantigen/Dockerfile

on branch

feature/merge_mafs

I don't have a singularity image for it since I was testing it on my local machine.

Any ideas?

evanbiederstedt commented 5 years ago

@allanbolipata

Slight error above---apologies. I typed /t not \t. I'll edit. Please try fwrite(huge_maf, "massive_final_analysis.maf", sep="\t").

I've tried it out within R at /juno/work/pi/cmopipeline/merge_mafs_data

The output I have is hey_final_analysis.maf, which looks like a valid MAF to me.

allanbolipata commented 5 years ago

huge_maf = rbindlist(lapply(list_of_maf_files, fread))

The above didn't work for some reason, but below did. The "\t" fix did it.

@evanbiederstedt Can you confirm that that is actually how you wanna go about merging the maf?

library(data.table)
maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns
neo_maf = fread("BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf")  ## 186 columns, 11 of which we need

neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")       

final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols])
fwrite(final_maf, "final_analysis.maf", sep="\t")
allanbolipata commented 5 years ago

See https://github.com/mskcc/vaporware/blob/feature/merge_mafs/containers/merge-facets-neoantigen/merge_mafs.R

kpjonsson commented 5 years ago

This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?

If so, would cut and paste be faster?

evanbiederstedt commented 5 years ago

This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?

I just used an example MAF. Otherwise, I'm not really sure.

If so, would cut and paste be faster?

Possibly

evanbiederstedt commented 5 years ago

This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?

Actually, I'm a bit confused by this @kpjonsson. What do you mean?

kpjonsson commented 5 years ago

The question was referring to https://github.com/mskcc/vaporware/blob/feature/merge_mafs/containers/merge-facets-neoantigen/merge_mafs.R

What am I asking is whether the order of mutations in MAF1 is

MutationA
MutationB
MutationC
[...]

and exactly the same in MAF2, or whether anything in the intervening steps changes it to for example

MutationB
MutationQ
MutationH
[...]
evanbiederstedt commented 5 years ago

Oh, I was thinking you were referring to the order of:


neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")       

The R script just reads in 2 MAFs, and adds the above columns neoantigen_cols to the original MAF. The order doesn't matter, if the columns exist (based on my fiddling/"tests")

kpjonsson commented 5 years ago

Right, but since you're essentially pasteing, you are assuming that the rowwise order is retained, and that's what I'm asking whether we know to be true or not.

Most likely it is, I just think we should make sure.

evanbiederstedt commented 5 years ago

Philip says close this issue, cuz' we're iksom såhär, helt kanon. Boom.

Right, but since you're essentially pasteing, you are assuming that the rowwise order is retained, and that's what I'm asking whether we know to be true or not.

Jävlar, maybe you're right....I finally see what you mean.

kpjonsson commented 5 years ago

Seems like it won't: https://github.com/taylor-lab/neoantigen-dev/blob/a873fc9cd42f4777200ee3d34000fb1b4c1539b7/neoantigen.py#L249

I still would run a small test to make sure.

evanbiederstedt commented 5 years ago

It's very tricky.

There are other tricks we could use to add columns to the other data.table. However, in order to make absolutely sure that the row order is conserved, we should merge the two data.tables.

It looks like there are 234 columns we could use :) That's a rather large setkey() operation...

kpjonsson commented 5 years ago

Looks fine to me:

test1 = fread(
    '/ifs/res/taylorlab/jonssonp/vaporware_test/test_dev_2019-07-03/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N/somatic_variants/mutations/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N.somatic.maf'
)

test2 = fread(
    '/ifs/res/taylorlab/jonssonp/vaporware_test/test_dev_2019-07-03/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N/somatic_variants/neoantigen/neoantigen/s_C_000008_T001_d-T.neoantigens.maf'
)

dim(test1)
[1] 108 197

dim(test2)
[1] 108 208

test1 = mutate(test1, tag = paste(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, sep = ':'))
test2 = mutate(test2, tag = paste(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, sep = ':'))

table(test1$tag == test2$tag)
TRUE 
108 
kpjonsson commented 5 years ago

huge_maf = rbindlist(lapply(list_of_maf_files, fread))

The above didn't work for some reason, but below did. The "\t" fix did it.

@evanbiederstedt Can you confirm that that is actually how you wanna go about merging the maf?

library(data.table)
maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns
neo_maf = fread("BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf")  ## 186 columns, 11 of which we need

neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")       

final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols])
fwrite(final_maf, "final_analysis.maf", sep="\t")

In develop we are passing the already FACETS-annotated MAF into the neoantigen process: https://github.com/mskcc/vaporware/blob/def19c68d5c3a978506e1db5cbd1e8879c92e270/pipeline.nf#L1462

Thus, this isn't necessary.

allanbolipata commented 5 years ago

Ah. I probably should've realized this as I was making it. WHOOPS.

evanbiederstedt commented 5 years ago

As discussed on Slack, this makes the most sense. The *BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf is the final MAF---it should have 243 columns

> dim(foo)
[1] 136 243
> colnames(foo)
  [1] "Hugo_Symbol"                   "Entrez_Gene_Id"               
  [3] "Center"                        "NCBI_Build"                   
  [5] "Chromosome"                    "Start_Position"               
  [7] "End_Position"                  "Strand"                       
  [9] "Variant_Classification"        "Variant_Type"                 
 [11] "Reference_Allele"              "Tumor_Seq_Allele1"            
 [13] "Tumor_Seq_Allele2"             "dbSNP_RS"                     
 [15] "dbSNP_Val_Status"              "Tumor_Sample_Barcode"         
 [17] "Matched_Norm_Sample_Barcode"   "Match_Norm_Seq_Allele1"       
 [19] "Match_Norm_Seq_Allele2"        "Tumor_Validation_Allele1"     
 [21] "Tumor_Validation_Allele2"      "Match_Norm_Validation_Allele1"
 [23] "Match_Norm_Validation_Allele2" "Verification_Status"          
 [25] "Validation_Status"             "Mutation_Status"              
 [27] "Sequencing_Phase"              "Sequence_Source"              
 [29] "Validation_Method"             "Score"                        
 [31] "BAM_File"                      "Sequencer"                    
 [33] "Tumor_Sample_UUID"             "Matched_Norm_Sample_UUID"     
 [35] "HGVSc"                         "HGVSp"                        
 [37] "HGVSp_Short"                   "Transcript_ID"                
 [39] "Exon_Number"                   "t_depth"                      
 [41] "t_ref_count"                   "t_alt_count"                  
 [43] "n_depth"                       "n_ref_count"                  
 [45] "n_alt_count"                   "all_effects"                  
 [47] "Allele"                        "Gene"                         
 [49] "Feature"                       "Feature_type"                 
 [51] "Consequence"                   "cDNA_position"                
 [53] "CDS_position"                  "Protein_position"             
 [55] "Amino_acids"                   "Codons"                       
 [57] "Existing_variation"            "ALLELE_NUM"                   
 [59] "DISTANCE"                      "STRAND_VEP"                   
 [61] "SYMBOL"                        "SYMBOL_SOURCE"                
 [63] "HGNC_ID"                       "BIOTYPE"                      
 [65] "CANONICAL"                     "CCDS"                         
 [67] "ENSP"                          "SWISSPROT"                    
 [69] "TREMBL"                        "UNIPARC"                      
 [71] "RefSeq"                        "SIFT"                         
 [73] "PolyPhen"                      "EXON"                         
 [75] "INTRON"                        "DOMAINS"                      
 [77] "AF"                            "AFR_AF"                       
 [79] "AMR_AF"                        "ASN_AF"                       
 [81] "EAS_AF"                        "EUR_AF"                       
 [83] "SAS_AF"                        "AA_AF"                        
 [85] "EA_AF"                         "CLIN_SIG"                     
 [87] "SOMATIC"                       "PUBMED"                       
 [89] "MOTIF_NAME"                    "MOTIF_POS"                    
 [91] "HIGH_INF_POS"                  "MOTIF_SCORE_CHANGE"           
 [93] "IMPACT"                        "PICK"                         
 [95] "VARIANT_CLASS"                 "TSL"                          
 [97] "HGVS_OFFSET"                   "PHENO"                        
 [99] "MINIMISED"                     "ExAC_AF"                      
[101] "ExAC_AF_AFR"                   "ExAC_AF_AMR"                  
[103] "ExAC_AF_EAS"                   "ExAC_AF_FIN"                  
[105] "ExAC_AF_NFE"                   "ExAC_AF_OTH"                  
[107] "ExAC_AF_SAS"                   "GENE_PHENO"                   
[109] "FILTER"                        "flanking_bps"                 
[111] "vcf_id"                        "vcf_qual"                     
[113] "ExAC_AF_Adj"                   "ExAC_AC_AN_Adj"               
[115] "ExAC_AC_AN"                    "ExAC_AC_AN_AFR"               
[117] "ExAC_AC_AN_AMR"                "ExAC_AC_AN_EAS"               
[119] "ExAC_AC_AN_FIN"                "ExAC_AC_AN_NFE"               
[121] "ExAC_AC_AN_OTH"                "ExAC_AC_AN_SAS"               
[123] "ExAC_FILTER"                   "gnomAD_AF"                    
[125] "gnomAD_AFR_AF"                 "gnomAD_AMR_AF"                
[127] "gnomAD_ASJ_AF"                 "gnomAD_EAS_AF"                
[129] "gnomAD_FIN_AF"                 "gnomAD_NFE_AF"                
[131] "gnomAD_OTH_AF"                 "gnomAD_SAS_AF"                
[133] "vcf_pos"                       "MuTect2"                      
[135] "Strelka2"                      "Strelka2FILTER"               
[137] "RepeatMasker"                  "EncodeDacMapability"          
[139] "PoN"                           "gnomAD_FILTER"                
[141] "non_cancer_AC_nfe_onf"         "non_cancer_AF_nfe_onf"        
[143] "non_cancer_AC_nfe_seu"         "non_cancer_AF_nfe_seu"        
[145] "non_cancer_AC_eas"             "non_cancer_AF_eas"            
[147] "non_cancer_AC_asj"             "non_cancer_AF_asj"            
[149] "non_cancer_AC_afr"             "non_cancer_AF_afr"            
[151] "non_cancer_AC_amr"             "non_cancer_AF_amr"            
[153] "non_cancer_AC_nfe_nwe"         "non_cancer_AF_nfe_nwe"        
[155] "non_cancer_AC_nfe"             "non_cancer_AF_nfe"            
[157] "non_cancer_AC_nfe_swe"         "non_cancer_AF_nfe_swe"        
[159] "non_cancer_AC"                 "non_cancer_AF"                
[161] "non_cancer_AC_fin"             "non_cancer_AF_fin"            
[163] "non_cancer_AC_eas_oea"         "non_cancer_AF_eas_oea"        
[165] "non_cancer_AC_raw"             "non_cancer_AF_raw"            
[167] "non_cancer_AC_sas"             "non_cancer_AF_sas"            
[169] "non_cancer_AC_eas_kor"         "non_cancer_AF_eas_kor"        
[171] "non_cancer_AC_popmax"          "non_cancer_AF_popmax"         
[173] "Ref_Tri"                       "mutation_effect"              
[175] "oncogenic"                     "LEVEL_1"                      
[177] "LEVEL_2A"                      "LEVEL_2B"                     
[179] "LEVEL_3A"                      "LEVEL_3B"                     
[181] "LEVEL_4"                       "LEVEL_R1"                     
[183] "LEVEL_R2"                      "LEVEL_R3"                     
[185] "Highest_level"                 "citations"                    
[187] "t_var_freq"                    "n_var_freq"                   
[189] "residue"                       "start_residue"                
[191] "end_residue"                   "variant_length"               
[193] "snv_hotspot"                   "threeD_hotspot"               
[195] "indel_hotspot_type"            "indel_hotspot"                
[197] "Hotspot"                       "dipLogR"                      
[199] "cf"                            "tcn"                          
[201] "lcn"                           "cf.em"                        
[203] "tcn.em"                        "lcn.em"                       
[205] "purity"                        "ploidy"                       
[207] "expected_alt_copies"           "ccf_Mcopies"                  
[209] "ccf_Mcopies_lower"             "ccf_Mcopies_upper"            
[211] "ccf_Mcopies_prob95"            "ccf_Mcopies_prob90"           
[213] "ccf_1copy"                     "ccf_1copy_lower"              
[215] "ccf_1copy_upper"               "ccf_1copy_prob95"             
[217] "ccf_1copy_prob90"              "ccf_Mcopies_em"               
[219] "ccf_Mcopies_lower_em"          "ccf_Mcopies_upper_em"         
[221] "ccf_Mcopies_prob95_em"         "ccf_Mcopies_prob90_em"        
[223] "ccf_1copy_em"                  "ccf_1copy_lower_em"           
[225] "ccf_1copy_upper_em"            "ccf_1copy_prob95_em"          
[227] "ccf_1copy_prob90_em"           "ccf_expected_copies_em"       
[229] "ccf_expected_lower_em"         "ccf_expected_upper_em"        
[231] "ccf_expected_prob95_em"        "ccf_expected_prob90_em"       
[233] "neo_maf_identifier_key"        "neo_best_icore_peptide"       
[235] "neo_best_rank"                 "neo_best_binding_affinity"    
[237] "neo_best_binder_class"         "neo_best_is_in_wt_peptidome"  
[239] "neo_best_algorithm"            "neo_best_hla_allele"          
[241] "neo_n_peptides_evaluated"      "neo_n_strong_binders"         
[243] "neo_n_weak_binders"           
evanbiederstedt commented 5 years ago

I actually deserve the most blame, as I wasn't sure 18 days ago whether it made the most sense to pass through the MAF, or quickly do a data.table merge on the current outputs.