waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

chr/start/stop nomenclature disagreements between maftools and GRanges #308

Closed grindhej closed 2 years ago

grindhej commented 2 years ago

maftools likes chrom/start/stop variables to be specifically called Chromosome/Start_Position/End_Position

However, in maftools maf2mae() function, they are conversted to seqnames/start/end for GRanges. See maf2mae function below, for reference

maf2mae function (m = NULL) { if (any(!requireNamespace("RaggedExperiment") | !requireNamespace("MultiAssayExperiment"))) { stop("Converting to MultiAssayExperiment requires RaggedExperiment and MultiAssayExperiment packages!") } if (is.null(m)) { stop("Missing input MAF") } maf_coldata = S4Vectors::DataFrame(getClinicalData(m)) rownames(maf_coldata) = maf_coldata$Tumor_Sample_Barcode maf_nsyn_gr = GenomicRanges::GRanges(seqnames = m@data$Chromosome, ranges = IRanges::IRanges(start = m@data$Start_Position, end = m@data$End_Position), strand = "", S4Vectors::DataFrame(m@data[, setdiff(x = colnames(m@data), y = c("Chromosome", "Start_Position", "End_Position")), with = FALSE])) maf_nsyn_gr = split(maf_nsyn_gr, as.factor(maf_nsyn_gr$Tumor_Sample_Barcode)) maf_nsyn_re = RaggedExperiment::RaggedExperiment(maf_nsyn_gr, colData = maf_coldata) maf_syn_gr = GenomicRanges::GRanges(seqnames = m@maf.silent$Chromosome, ranges = IRanges::IRanges(start = m@maf.silent$Start_Position, end = m@maf.silent$End_Position), strand = "", S4Vectors::DataFrame(m@maf.silent[, setdiff(x = colnames(m@maf.silent), y = c("Chromosome", "Start_Position", "End_Position")), with = FALSE])) maf_syn_gr = split(maf_syn_gr, as.factor(maf_syn_gr$Tumor_Sample_Barcode)) maf_syn_re = RaggedExperiment::RaggedExperiment(maf_syn_gr, colData = maf_coldata) MultiAssayExperiment::MultiAssayExperiment(experiments = list(maf_nonSyn = maf_nsyn_re, maf_syn = maf_syn_re), metadata = list(summary = m@summary, variant.classification.summary = m@variant.classification.summary, variant.type.summary = m@variant.type.summary, variants.per.sample = m@variants.per.sample, gene.summary = m@gene.summary), colData = maf_coldata) }

It would be nice if MultiAssayExperimentToMAF converted them back to what maftools likes. Here's some example code where I pull a maf out of a mae with MultiAssayExperimentToMAF:

names(maf@data)[names(maf@data) == "seqnames"] = "Chromosome" names(maf@data)[names(maf@data) == "start"] = "Start_Position" names(maf@data)[names(maf@data) == "end"] = "End_Position" names(maf@maf.silent)[names(maf@maf.silent) == "seqnames"] = "Chromosome" names(maf@maf.silent)[names(maf@maf.silent) == "start"] = "Start_Position" names(maf@maf.silent)[names(maf@maf.silent) == "end"] = "End_Position"

I understand if there's some reason that isn't feasible.

Thanks!

LiNk-NY commented 2 years ago

@grindhej

I think this is something for maftools to implement since it is outside of the Bioconductor conventions. It could be implemented as a helper function to make the names maftools friendly and possibly applied automatically.

The best case scenario is that maftools supports both types of names (i.e., GRanges and maftools friendly). The GRanges names for sure increase interoperability with other packages that work on ranges. For example, with the right names, one can easily do makeGRangesFromDataFrame:

library(maftools)
library(MultiAssayExperiment)
laml.maf <- system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml.clin <- system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')
laml <- read.maf(maf = laml.maf, clinicalData = laml.clin)
lamf <- maf2mae(laml)
maef <- MultiAssayExperimentToMAF(lamf)
makeGRangesFromDataFrame(maef@data, ignore.strand=TRUE)
GRanges object with 1732 ranges and 0 metadata columns:
         seqnames              ranges strand
            <Rle>           <IRanges>  <Rle>
     [1]       10            37455582      *
     [2]       20            35240439      *
     [3]        2            25457243      *
     [4]        2           209113112      *
     [5]        5 170834736-170834737      *
     ...      ...                 ...    ...
  [1728]       22            29688596      *
  [1729]        4            48577210      *
  [1730]       15            42641596      *
  [1731]       16     2812699-2812700      *
  [1732]       22            22324724      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

Note that the use of @ is not standard practice (https://contributions.bioconductor.org/r-code.html?q=accessor#code-syntax-and-efficiency).

Best regards, Marcel