Open LTLA opened 5 years ago
Thanks for the pointer! Development is paused at the moment on my side (busy on a million other things) but I'm happy to see PR's that take this proof of concept package further!
I'm hardly loaded with free time myself, so I guess this will just have to wait.
Having said that, I don't understand why you've set it up in "long" format, i.e., where each entry is a (gene, gene set) pair. It seems like it would be much more intuitive to store it as a list - in particular, a CompressedCharacterList
that gives you the list intuition with vector-like convenience.
sets_list <- list(
set1 = c("A", "B"),
set2 = c("B", "C", "D")
)
library(IRanges)
sets_list <- CharacterList(sets_list)
alt <- paste0("GENE_", unlist(sets_list)) # say you want to modify the characters...
new_sets_list <- relist(alt, sets_list)
Especially since most gene set analysis methods would accept lists of character vectors anyway.
I never took the time to reply to that point:
Having said that, I don't understand why you've set it up in "long" format, i.e., where each entry is a (gene, gene set) pair. It seems like it would be much more intuitive to store it as a list - in particular, a CompressedCharacterList that gives you the list intuition with vector-like convenience.
I haven't made a list of all the benefits of storing gene sets in long format, but to me the most appealing example is the ability to store metadata on elements, sets, and relationships between elements and sets.
I'm not saying that it's not possible to list-like objects that contain the same metadata (GSEABase
set a standard there), but it was brought up multiple times that this type of implementation as list suffers poor performance beyond a few hundred sets (https://docs.google.com/document/d/1A3bs1rtbTo42Sgm9hPbLoG1lTGbQ-ITENaLRVyK2Njo/edit).
For illustration, here is what unisets
can do at the moment, including reformatting as.list(...)
.
suppressPackageStartupMessages({library(unisets)})
# Fetch a sample of GO annotations
suppressPackageStartupMessages({library(org.Hs.eg.db)})
go_sets <- import(org.Hs.egGO)
#> Loading required namespace: GO.db
#>
#> 'select()' returned 1:1 mapping between keys and columns
#> Coercing evidence to factor
#> Coercing ontology to factor
subset(go_sets, evidence == "TAS")
#> GOSets with 43182 relations between 10676 elements and 3918 sets
#> element set evidence ontology
#> <character> <character> <factor> <factor>
#> 1 1 GO:0002576 TAS BP
#> 2 1 GO:0043312 TAS BP
#> 3 2 GO:0002576 TAS BP
#> 4 2 GO:0007597 TAS BP
#> 5 2 GO:0022617 TAS BP
#> ... ... ... ...
#> 1 100533105 GO:0017080 TAS MF
#> 2 100533105 GO:0017081 TAS MF
#> 3 102157402 GO:0050145 TAS MF
#> 4 109703458 GO:0018812 TAS MF
#> 5 110354863 GO:0003700 TAS MF
#>
#> @elementData
#> EntrezIdVector of length 10676 with 10676 unique identifiers
#> Ids: 1, 10, 100, 1000, ...
#> Metadata: (0 columns)
#>
#> @setData
#> GOIdVector of length 3918 with 3918 unique identifiers
#> Ids: GO:0002576, GO:0043312, GO:0007597, GO:0010951, ...
#> Metadata: GOID, DEFINITION, ONTOLOGY, TERM (4 columns)
as.list(subset(go_sets, evidence == "TAS"), 10)
#> List of length 3918
#> names(3918): GO:0000002 GO:0000018 GO:0000019 GO:0000022 GO:0000026 ... GO:2001244 GO:2001256 GO:2001257 GO:2001301
For the sake of being comprehensive, here is a summary report on the current efforts toward implementations as tables: https://docs.google.com/document/d/1Lk6TLUuevidbLJvq36MFVY04GkvdrbF0ctGuB_BENaM/edit
Well, I can't say I understand the use cases that led you to make that decision, but whatever.
Anyway, if you proceed with the current class, you would make your life a lot easier by deriving BaseSets
as an IndexedRelations
subclass. You would have two partners in each relation, with indices pointing to DataFrame
s of per-gene and per-set information - no need to write a separate IdVector
class.
Once the S4Vectors PR is merged, you'll get sensible c
, sort
, match
etc. behaviour for free.
Couldn't help but notice your
BaseSets
class looks very similar to myIndexedRelations
class here. Probably with some work, we could massageBaseSets
into an inheritance hierarchy onIndexedRelations
to get the most value for money. The most obvious requirement is to get theDataFrame
to exhibit vector-like behaviour along rows, as discussed in Bioconductor/S4Vectors#32.