Closed evanbiederstedt closed 5 years ago
RE: MAF for each TN pair
For a single TN pair, I currently see we output 4 MAFs:
$ find . -name *maf
./germline_variants/mutations/BRCA_00067-T_vs_BRCA_00067-N.germline.maf
./somatic_variants/neoantigen/neoantigen/BRCA_00067-T.neoantigens.maf
./somatic_variants/mutations/BRCA_00067-T_vs_BRCA_00067-N.maf
./somatic_variants/facets_maf
./somatic_variants/facets_maf/BRCA_00067-T_vs_BRCA_00067-N.facets.maf
(1) facets_maf
is the same as BRCA_00067-T_vs_BRCA_00067-N.facets.maf
, so we can delete one
(2) The FACETS MAF has the 35 extra columns of an "FACETS-annotated MAF" against the original MAF. Therefore, we can delete BRCA_00067-T_vs_BRCA_00067-N.maf
and work with BRCA_00067-T_vs_BRCA_00067-N.facets.maf
(3) The neoantigens MAF is currently the using the original MAF.
Here's one approach in R:
library(data.table)
maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns
neo_maf = fread("BRCA_00067-T.neoantigens.maf") ## 186 columns, 11 of which we need
neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")
final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols])
fwrite(final_maf, "final_analysis.maf", sep="/t")
And now there is a final MAF final_analysis.maf
for the TN pair.
EDIT: We could also do this with pandas from python DataFrame
with merge
and to_csv()
, but I actually think R's data.table
is more RAM efficient, and more computationally performant.
RE: MAF combined across TN pairs
Let's assume we have multiple final_analysis.maf
for each TN pair. e.g. BRCA_00067-T_vs_BRCA_00067-N.final_analysis.maf,
BRCA_00117-T_vs_BRCA_00117-N.final_analysis.maf`
Here is how we could combine them in R (which excels for these sorts of massive data.table operations):
library(data.table)
## as an example, using data_dir; maybe it's getwd(), etc.
list_of_maf_files = list.files(data_dir, "\\.maf$", full.names = TRUE)
## read in the data.table's into a list, possible to use mclapply, but let's avoid this for now
## list_of_dts = lapply(list_of_maf_files, fread)
huge_maf = rbindlist(lapply(list_of_maf_files, fread))
### EDIT this is wrong ---> fwrite(huge_maf, "massive_final_analysis.maf", sep="/t")
fwrite(huge_maf, "massive_final_analysis.maf", sep="\t")
My only reservation here is that I have no idea how memory intensive this is. I believe at a certain point, data.table
will start using cached memory.
@Evan did you already do this? It looks like you did it... is this tested/ran anywhere?
I'm really confused.
I'm really confused.
We haven't agreed upon how we are implementing this, so I went through each of these issues, and tried to define what precisely Barry/Mark want.
@evanbiederstedt I tried your code and got this error using BRCA_00067-T_vs_BRCA_00067-N.facets.maf
and BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf
as inputs. You can see them at /juno/work/pi/cmopipeline/merge_mafs_data/
to test yourself.
This was where the error was.
> fwrite(final_maf, "final_analysis.maf", sep="/t")
Error in fwrite(final_maf, "final_analysis.maf", sep = "/t") :
is.character(sep) && length(sep) == 1L && nchar(sep) == 1L is not TRUE
I am using an image from the Dockerfile in
containers/merge-facets-neoantigen/Dockerfile
on branch
feature/merge_mafs
I don't have a singularity image for it since I was testing it on my local machine.
Any ideas?
@allanbolipata
Slight error above---apologies. I typed /t
not \t
. I'll edit.
Please try fwrite(huge_maf, "massive_final_analysis.maf", sep="\t")
.
I've tried it out within R at /juno/work/pi/cmopipeline/merge_mafs_data
The output I have is hey_final_analysis.maf
, which looks like a valid MAF to me.
huge_maf = rbindlist(lapply(list_of_maf_files, fread))
The above didn't work for some reason, but below did. The "\t" fix did it.
@evanbiederstedt Can you confirm that that is actually how you wanna go about merging the maf?
library(data.table)
maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns
neo_maf = fread("BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf") ## 186 columns, 11 of which we need
neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")
final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols])
fwrite(final_maf, "final_analysis.maf", sep="\t")
This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?
If so, would cut
and paste
be faster?
This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?
I just used an example MAF. Otherwise, I'm not really sure.
If so, would cut and paste be faster?
Possibly
This presumes the row order is retained in the neoantigen annotated MAF, do we know that to be true?
Actually, I'm a bit confused by this @kpjonsson. What do you mean?
The question was referring to https://github.com/mskcc/vaporware/blob/feature/merge_mafs/containers/merge-facets-neoantigen/merge_mafs.R
What am I asking is whether the order of mutations in MAF1 is
MutationA
MutationB
MutationC
[...]
and exactly the same in MAF2, or whether anything in the intervening steps changes it to for example
MutationB
MutationQ
MutationH
[...]
Oh, I was thinking you were referring to the order of:
neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders")
The R script just reads in 2 MAFs, and adds the above columns neoantigen_cols
to the original MAF. The order doesn't matter, if the columns exist (based on my fiddling/"tests")
Right, but since you're essentially paste
ing, you are assuming that the rowwise order is retained, and that's what I'm asking whether we know to be true or not.
Most likely it is, I just think we should make sure.
Philip says close this issue, cuz' we're iksom såhär, helt kanon. Boom.
Right, but since you're essentially pasteing, you are assuming that the rowwise order is retained, and that's what I'm asking whether we know to be true or not.
Jävlar, maybe you're right....I finally see what you mean.
Seems like it won't: https://github.com/taylor-lab/neoantigen-dev/blob/a873fc9cd42f4777200ee3d34000fb1b4c1539b7/neoantigen.py#L249
I still would run a small test to make sure.
It's very tricky.
There are other tricks we could use to add columns to the other data.table. However, in order to make absolutely sure that the row order is conserved, we should merge the two data.tables.
It looks like there are 234 columns we could use :) That's a rather large setkey()
operation...
Looks fine to me:
test1 = fread(
'/ifs/res/taylorlab/jonssonp/vaporware_test/test_dev_2019-07-03/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N/somatic_variants/mutations/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N.somatic.maf'
)
test2 = fread(
'/ifs/res/taylorlab/jonssonp/vaporware_test/test_dev_2019-07-03/s_C_000008_T001_d-T_vs_s_C_000008_N001_d-N/somatic_variants/neoantigen/neoantigen/s_C_000008_T001_d-T.neoantigens.maf'
)
dim(test1)
[1] 108 197
dim(test2)
[1] 108 208
test1 = mutate(test1, tag = paste(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, sep = ':'))
test2 = mutate(test2, tag = paste(Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, sep = ':'))
table(test1$tag == test2$tag)
TRUE
108
huge_maf = rbindlist(lapply(list_of_maf_files, fread))
The above didn't work for some reason, but below did. The "\t" fix did it.
@evanbiederstedt Can you confirm that that is actually how you wanna go about merging the maf?
library(data.table) maf_file = fread("BRCA_00067-T_vs_BRCA_00067-N.facets.maf") ## 210 columns neo_maf = fread("BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf") ## 186 columns, 11 of which we need neoantigen_cols = c("neo_maf_identifier_key", "neo_best_icore_peptide", "neo_best_rank", "neo_best_binding_affinity", "neo_best_binder_class", "neo_best_is_in_wt_peptidome", "neo_best_algorithm", "neo_best_hla_allele","neo_n_peptides_evaluated", "neo_n_strong_binders", "neo_n_weak_binders") final_maf = cbind(maf_file, neo_maf[, ..neoantigen_cols]) fwrite(final_maf, "final_analysis.maf", sep="\t")
In develop
we are passing the already FACETS-annotated MAF into the neoantigen process:
https://github.com/mskcc/vaporware/blob/def19c68d5c3a978506e1db5cbd1e8879c92e270/pipeline.nf#L1462
Thus, this isn't necessary.
Ah. I probably should've realized this as I was making it. WHOOPS.
As discussed on Slack, this makes the most sense. The *BRCA_00067-T_vs_BRCA_00067-N.neoantigens.maf
is the final MAF---it should have 243 columns
> dim(foo)
[1] 136 243
> colnames(foo)
[1] "Hugo_Symbol" "Entrez_Gene_Id"
[3] "Center" "NCBI_Build"
[5] "Chromosome" "Start_Position"
[7] "End_Position" "Strand"
[9] "Variant_Classification" "Variant_Type"
[11] "Reference_Allele" "Tumor_Seq_Allele1"
[13] "Tumor_Seq_Allele2" "dbSNP_RS"
[15] "dbSNP_Val_Status" "Tumor_Sample_Barcode"
[17] "Matched_Norm_Sample_Barcode" "Match_Norm_Seq_Allele1"
[19] "Match_Norm_Seq_Allele2" "Tumor_Validation_Allele1"
[21] "Tumor_Validation_Allele2" "Match_Norm_Validation_Allele1"
[23] "Match_Norm_Validation_Allele2" "Verification_Status"
[25] "Validation_Status" "Mutation_Status"
[27] "Sequencing_Phase" "Sequence_Source"
[29] "Validation_Method" "Score"
[31] "BAM_File" "Sequencer"
[33] "Tumor_Sample_UUID" "Matched_Norm_Sample_UUID"
[35] "HGVSc" "HGVSp"
[37] "HGVSp_Short" "Transcript_ID"
[39] "Exon_Number" "t_depth"
[41] "t_ref_count" "t_alt_count"
[43] "n_depth" "n_ref_count"
[45] "n_alt_count" "all_effects"
[47] "Allele" "Gene"
[49] "Feature" "Feature_type"
[51] "Consequence" "cDNA_position"
[53] "CDS_position" "Protein_position"
[55] "Amino_acids" "Codons"
[57] "Existing_variation" "ALLELE_NUM"
[59] "DISTANCE" "STRAND_VEP"
[61] "SYMBOL" "SYMBOL_SOURCE"
[63] "HGNC_ID" "BIOTYPE"
[65] "CANONICAL" "CCDS"
[67] "ENSP" "SWISSPROT"
[69] "TREMBL" "UNIPARC"
[71] "RefSeq" "SIFT"
[73] "PolyPhen" "EXON"
[75] "INTRON" "DOMAINS"
[77] "AF" "AFR_AF"
[79] "AMR_AF" "ASN_AF"
[81] "EAS_AF" "EUR_AF"
[83] "SAS_AF" "AA_AF"
[85] "EA_AF" "CLIN_SIG"
[87] "SOMATIC" "PUBMED"
[89] "MOTIF_NAME" "MOTIF_POS"
[91] "HIGH_INF_POS" "MOTIF_SCORE_CHANGE"
[93] "IMPACT" "PICK"
[95] "VARIANT_CLASS" "TSL"
[97] "HGVS_OFFSET" "PHENO"
[99] "MINIMISED" "ExAC_AF"
[101] "ExAC_AF_AFR" "ExAC_AF_AMR"
[103] "ExAC_AF_EAS" "ExAC_AF_FIN"
[105] "ExAC_AF_NFE" "ExAC_AF_OTH"
[107] "ExAC_AF_SAS" "GENE_PHENO"
[109] "FILTER" "flanking_bps"
[111] "vcf_id" "vcf_qual"
[113] "ExAC_AF_Adj" "ExAC_AC_AN_Adj"
[115] "ExAC_AC_AN" "ExAC_AC_AN_AFR"
[117] "ExAC_AC_AN_AMR" "ExAC_AC_AN_EAS"
[119] "ExAC_AC_AN_FIN" "ExAC_AC_AN_NFE"
[121] "ExAC_AC_AN_OTH" "ExAC_AC_AN_SAS"
[123] "ExAC_FILTER" "gnomAD_AF"
[125] "gnomAD_AFR_AF" "gnomAD_AMR_AF"
[127] "gnomAD_ASJ_AF" "gnomAD_EAS_AF"
[129] "gnomAD_FIN_AF" "gnomAD_NFE_AF"
[131] "gnomAD_OTH_AF" "gnomAD_SAS_AF"
[133] "vcf_pos" "MuTect2"
[135] "Strelka2" "Strelka2FILTER"
[137] "RepeatMasker" "EncodeDacMapability"
[139] "PoN" "gnomAD_FILTER"
[141] "non_cancer_AC_nfe_onf" "non_cancer_AF_nfe_onf"
[143] "non_cancer_AC_nfe_seu" "non_cancer_AF_nfe_seu"
[145] "non_cancer_AC_eas" "non_cancer_AF_eas"
[147] "non_cancer_AC_asj" "non_cancer_AF_asj"
[149] "non_cancer_AC_afr" "non_cancer_AF_afr"
[151] "non_cancer_AC_amr" "non_cancer_AF_amr"
[153] "non_cancer_AC_nfe_nwe" "non_cancer_AF_nfe_nwe"
[155] "non_cancer_AC_nfe" "non_cancer_AF_nfe"
[157] "non_cancer_AC_nfe_swe" "non_cancer_AF_nfe_swe"
[159] "non_cancer_AC" "non_cancer_AF"
[161] "non_cancer_AC_fin" "non_cancer_AF_fin"
[163] "non_cancer_AC_eas_oea" "non_cancer_AF_eas_oea"
[165] "non_cancer_AC_raw" "non_cancer_AF_raw"
[167] "non_cancer_AC_sas" "non_cancer_AF_sas"
[169] "non_cancer_AC_eas_kor" "non_cancer_AF_eas_kor"
[171] "non_cancer_AC_popmax" "non_cancer_AF_popmax"
[173] "Ref_Tri" "mutation_effect"
[175] "oncogenic" "LEVEL_1"
[177] "LEVEL_2A" "LEVEL_2B"
[179] "LEVEL_3A" "LEVEL_3B"
[181] "LEVEL_4" "LEVEL_R1"
[183] "LEVEL_R2" "LEVEL_R3"
[185] "Highest_level" "citations"
[187] "t_var_freq" "n_var_freq"
[189] "residue" "start_residue"
[191] "end_residue" "variant_length"
[193] "snv_hotspot" "threeD_hotspot"
[195] "indel_hotspot_type" "indel_hotspot"
[197] "Hotspot" "dipLogR"
[199] "cf" "tcn"
[201] "lcn" "cf.em"
[203] "tcn.em" "lcn.em"
[205] "purity" "ploidy"
[207] "expected_alt_copies" "ccf_Mcopies"
[209] "ccf_Mcopies_lower" "ccf_Mcopies_upper"
[211] "ccf_Mcopies_prob95" "ccf_Mcopies_prob90"
[213] "ccf_1copy" "ccf_1copy_lower"
[215] "ccf_1copy_upper" "ccf_1copy_prob95"
[217] "ccf_1copy_prob90" "ccf_Mcopies_em"
[219] "ccf_Mcopies_lower_em" "ccf_Mcopies_upper_em"
[221] "ccf_Mcopies_prob95_em" "ccf_Mcopies_prob90_em"
[223] "ccf_1copy_em" "ccf_1copy_lower_em"
[225] "ccf_1copy_upper_em" "ccf_1copy_prob95_em"
[227] "ccf_1copy_prob90_em" "ccf_expected_copies_em"
[229] "ccf_expected_lower_em" "ccf_expected_upper_em"
[231] "ccf_expected_prob95_em" "ccf_expected_prob90_em"
[233] "neo_maf_identifier_key" "neo_best_icore_peptide"
[235] "neo_best_rank" "neo_best_binding_affinity"
[237] "neo_best_binder_class" "neo_best_is_in_wt_peptidome"
[239] "neo_best_algorithm" "neo_best_hla_allele"
[241] "neo_n_peptides_evaluated" "neo_n_strong_binders"
[243] "neo_n_weak_binders"
I actually deserve the most blame, as I wasn't sure 18 days ago whether it made the most sense to pass through the MAF, or quickly do a data.table merge on the current outputs.
""" 2. Single somatic MAF file • Includes all somatic mutations (point/indels) along with annotation, filtering, FACETS CCFs, neoantigen predictions (neo_ columns from somatic_variants/neoantigen/neoantigen/.neoantigens.maf), etc. • No other MAF to be retained at this time """
MAFs are the worst file format known to man. They are massive tab-delimited text files with dozens and dozens of columns. The first line is something like
version #2.4
https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/RE: MAF for each TN pair Basically, we now have 3 MAFs, there big ridiculous data.tables. For each TN pair, we should make this one MAF---add columns to the main MAF.
This is R data.table with merge: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
RE: MAF combined across TN pairs These MAFs can be big, so creating a massive MAF over 1000s of samples may require lots of RAM----let's think about those, or informalities