kevinrue / unisets

Collection of classes to store gene sets.
http://kevinrue.github.io/unisets
Other
3 stars 1 forks source link

IndexedRelations #47

Open LTLA opened 5 years ago

LTLA commented 5 years ago

Couldn't help but notice your BaseSets class looks very similar to my IndexedRelations class here. Probably with some work, we could massage BaseSets into an inheritance hierarchy on IndexedRelations to get the most value for money. The most obvious requirement is to get the DataFrame to exhibit vector-like behaviour along rows, as discussed in Bioconductor/S4Vectors#32.

kevinrue commented 5 years ago

Thanks for the pointer! Development is paused at the moment on my side (busy on a million other things) but I'm happy to see PR's that take this proof of concept package further!

LTLA commented 5 years ago

I'm hardly loaded with free time myself, so I guess this will just have to wait.

Having said that, I don't understand why you've set it up in "long" format, i.e., where each entry is a (gene, gene set) pair. It seems like it would be much more intuitive to store it as a list - in particular, a CompressedCharacterList that gives you the list intuition with vector-like convenience.

sets_list <- list(
    set1 = c("A", "B"),
    set2 = c("B", "C", "D")
)

library(IRanges)
sets_list <- CharacterList(sets_list)
alt <- paste0("GENE_", unlist(sets_list)) # say you want to modify the characters...
new_sets_list <- relist(alt, sets_list)

Especially since most gene set analysis methods would accept lists of character vectors anyway.

kevinrue commented 5 years ago

I never took the time to reply to that point:

Having said that, I don't understand why you've set it up in "long" format, i.e., where each entry is a (gene, gene set) pair. It seems like it would be much more intuitive to store it as a list - in particular, a CompressedCharacterList that gives you the list intuition with vector-like convenience.

I haven't made a list of all the benefits of storing gene sets in long format, but to me the most appealing example is the ability to store metadata on elements, sets, and relationships between elements and sets. I'm not saying that it's not possible to list-like objects that contain the same metadata (GSEABase set a standard there), but it was brought up multiple times that this type of implementation as list suffers poor performance beyond a few hundred sets (https://docs.google.com/document/d/1A3bs1rtbTo42Sgm9hPbLoG1lTGbQ-ITENaLRVyK2Njo/edit).

For illustration, here is what unisets can do at the moment, including reformatting as.list(...).

suppressPackageStartupMessages({library(unisets)})

# Fetch a sample of GO annotations
suppressPackageStartupMessages({library(org.Hs.eg.db)})
go_sets <- import(org.Hs.egGO)
#> Loading required namespace: GO.db
#> 
#> 'select()' returned 1:1 mapping between keys and columns
#> Coercing evidence to factor
#> Coercing ontology to factor

subset(go_sets, evidence == "TAS")
#> GOSets with 43182 relations between 10676 elements and 3918 sets
#>       element         set evidence ontology
#>   <character> <character> <factor> <factor>
#> 1           1  GO:0002576      TAS       BP
#> 2           1  GO:0043312      TAS       BP
#> 3           2  GO:0002576      TAS       BP
#> 4           2  GO:0007597      TAS       BP
#> 5           2  GO:0022617      TAS       BP
#>           ...         ...      ...      ...
#> 1   100533105  GO:0017080      TAS       MF
#> 2   100533105  GO:0017081      TAS       MF
#> 3   102157402  GO:0050145      TAS       MF
#> 4   109703458  GO:0018812      TAS       MF
#> 5   110354863  GO:0003700      TAS       MF
#> 
#> @elementData
#> EntrezIdVector of length 10676 with 10676 unique identifiers
#> Ids: 1, 10, 100, 1000, ...
#> Metadata:  (0 columns)
#> 
#> @setData
#> GOIdVector of length 3918 with 3918 unique identifiers
#> Ids: GO:0002576, GO:0043312, GO:0007597, GO:0010951, ...
#> Metadata: GOID, DEFINITION, ONTOLOGY, TERM (4 columns)

as.list(subset(go_sets, evidence == "TAS"), 10)
#> List of length 3918
#> names(3918): GO:0000002 GO:0000018 GO:0000019 GO:0000022 GO:0000026 ... GO:2001244 GO:2001256 GO:2001257 GO:2001301

For the sake of being comprehensive, here is a summary report on the current efforts toward implementations as tables: https://docs.google.com/document/d/1Lk6TLUuevidbLJvq36MFVY04GkvdrbF0ctGuB_BENaM/edit

LTLA commented 5 years ago

Well, I can't say I understand the use cases that led you to make that decision, but whatever.

Anyway, if you proceed with the current class, you would make your life a lot easier by deriving BaseSets as an IndexedRelations subclass. You would have two partners in each relation, with indices pointing to DataFrames of per-gene and per-set information - no need to write a separate IdVector class.

Once the S4Vectors PR is merged, you'll get sensible c, sort, match etc. behaviour for free.