varemo / piano

piano - An R/Bioconductor package for gene set analysis
https://varemo.github.io/piano/
12 stars 4 forks source link

How to make Gene Matrix Transposed (*.gmt) file or export from list like object of class GSC? #7

Closed janstrauss1 closed 4 years ago

janstrauss1 commented 4 years ago

Dear @varemo,

I'm currently trying to make a custom GMT file from a two-column data.frame containing genes and gene sets (Gene Ontology terms) that represent all gene-to-gene set connections.

My input data looks somewhat like this:

> head(my_GO)
         SYMBOL                                                     TERM_GOID
1 PF3D7_1038400                                       reproduction GO:0000003
2 PF3D7_0816800                                       reproduction GO:0000003
3 PF3D7_0623400                                       reproduction GO:0000003
4 PF3D7_0725200                                       reproduction GO:0000003
5 PF3D7_0605800 single-stranded DNA endodeoxyribonuclease activity GO:0000014
6 PF3D7_1015900                  phosphopyruvate hydratase complex GO:0000015

and I can successfully create a gene set collection using

myGSC <- loadGSC(file = my_GO, type = "auto") 

that I can use for runGSA.

However, I'd like to export a *.gmt file for my custom gene set collection but struggle to convert the myGSC list-like object to a dataframe for exporting it as a *.gmt file.

Could you please provide me with some directions on what would be the best way to accomplish this?

Many thanks in advance!

Jan

varemo commented 4 years ago

Something along the following lines might work?

gmt <- vector()
for(i in seq_along(myGSC$gsc)) {
    gmt <- rbind(cbind(names(myGSC$gsc)[i],myGSC$gsc[[i]]), gmt)
}
janstrauss1 commented 4 years ago

Thanks a lot for your suggestion! Unfortunately, it's not quite yet what I'm trying to get as it returns a matrix like the following:

> head(gmt)
     [,1]                                                            [,2]           
[1,] "reproduction GO:0000003"                                       "PF3D7_1038400"
[2,] "reproduction GO:0000003"                                       "PF3D7_0816800"
[3,] "reproduction GO:0000003"                                       "PF3D7_0623400"
[4,] "reproduction GO:0000003"                                       "PF3D7_0725200"
[5,] "single-stranded DNA endodeoxyribonuclease activity GO:0000014" "PF3D7_0605800"
[6,] "phosphopyruvate hydratase complex GO:0000015"                  "PF3D7_1015900"

But what I'm trying to obtain is something like the following (transposed) data structure according to gmt file format conventions:

     [1,]      [2,]  [3,]  [4,]  [5,]
[1,] gene_set1 gene1 gene2 gene3 gene4
[2,] gene_set2 gene5 NA    NA    NA
[3,] gene_set3 gene6 NA    NA    NA
...

Any ideas how to get there?

Thanks again for your help!

Jan

varemo commented 4 years ago

Sorry, sloppy reading, missed the format you wanted. Check if the code below does better?

maxGeneSetSize <- max(unlist(lapply(myGSC$gsc, length)))
gmt <- lapply(myGSC$gsc, function(x) {length(x) <- maxGeneSetSize; return(x)})
gmt <- matrix(unlist(gmt), nrow=length(myGSC$gsc), byrow=T)
rownames(gmt) <- names(myGSC$gsc)
janstrauss1 commented 4 years ago

Hi @varemo, it works beautifully 👍 Many thanks for your help!