hms-dbmi / scde

R package for analyzing single-cell RNA-seq data
http://pklab.med.harvard.edu/scde
Other
172 stars 66 forks source link

Correct construction of go.env #9

Closed jenzopr closed 9 years ago

jenzopr commented 9 years ago

Hi all, I'm trying to use the biomaRt-package together with GO.db to construct a proper go.env environment for evaluation of overdispered gene sets (like in http://hms-dbmi.github.io/scde/pagoda.html). Can someone clarify what the structure of go.env is, before it goes into the list2env function in the example code on the web site mentioned above? I failed to reproduce the code given there and a head() on go.env would be enough for me I guess. Thanks a lot, Jens

JEFworks commented 9 years ago

Hi Jens,

go.env before it goes into the list2env function is a list of lists with the GO term names as the primary list and for each GO term, there is a list of gene names associated where the gene names correspond to the row names of your counts matrix. Then list2env just turns it into an R environment, which was convenient for us to toggle between different gene sets during development.

If you want to browse the go.env object in its environment state, you will need to do the following:

> data(go.env)
> gos <- ls(go.env)  # list the gene sets
> head(gos)
[1] "GO:0000002 mitochondrial genome maintenance"                   "GO:0000003 reproduction"                                      
[3] "GO:0000012 single strand break repair"                         "GO:0000014 single-stranded DNA endodeoxyribonuclease activity"
[5] "GO:0000018 regulation of DNA recombination"                    "GO:0000028 ribosomal small subunit assembly"                  
> get(gos[1], go.env)  # get the genes in the first gene set
 [1] "SLC25A4"  "DNA2"     "TYMP"     "LIG3"     "MEF2A"    "MPV17"    "MDP1"     "DNAJA3"   "LONP1"    "LONP1"    "AKT3"     "PPARGC1A"
[13] "STOML2"   "RRM2B"    "PID1"     "C10orf2"  "C10orf2"  "PIF1"     "SESN2"    "MGME1"    "MGME1"    "CCDC111"  "RNASEH1" 

Here is also a tutorial on how to go from a gmt file to one of these environment along with some other common gene sets in the same format as go.env: https://github.com/JEFworks/genesets

Hope that helps! Let me know if you need anything else.

Best, Jean

pkharchenko commented 9 years ago

Hi Jens, Hopefully Jean’s reply clarifies how the environment is constructed. Can you please let us know what errors you’ve encountered in trying to reproduce the current tutorial code, so that we can fix that. Thanks, -peter.

On Aug 6, 2015, at 4:43 AM, Jens Preußner notifications@github.com wrote:

Hi all, I'm trying to use the biomaRt-package together with GO.db to construct a proper go.env environment for evaluation of overdispered gene sets (like in http://hms-dbmi.github.io/scde/pagoda.html http://hms-dbmi.github.io/scde/pagoda.html). Can someone clarify what the structure of go.env is, before it goes into the list2env function in the example code on the web site mentioned above? I failed to reproduce the code given there and a head() on go.env would be enough for me I guess. Thanks a lot, Jens

— Reply to this email directly or view it on GitHub https://github.com/hms-dbmi/scde/issues/9.

jenzopr commented 9 years ago

Great! Your comments helped me a lot. Thanks, Jean, for pointing me to the gene sets you created from MSigDB. This is really great! So, if I got it right, two things hold: 1) If, for example, the row names are Ensemble IDs, the go.env does not need to contain gene symbols, but those Ensembl IDs. 2) The names of the list items are just for identification, they also could be something else.

Peter, I was able to run the current tutorial code, but I wanted to to use the biomaRt library to not depend on org.Hs.eg.db. If you're interested, I can create a pull request with the alternative code to the gh-pages branch. Just let me know :)

JEFworks commented 9 years ago

Hi Jens,

1) Yes, if the rownames are Ensembl IDs, then the go.env should contain lists of lists of Ensemble IDs. We use go.env to grab relevant rows from your count matrix, so they need to match. 2) Yes, the names in the first list are just identifiers. They are later used in the PAGODA app so it helps if they're descriptive for browsing purposes (as opposed to just 'GO:XYZ', I made them 'GO:XYZ description')

Yes, please do make a pull request! I'd be happy to integrate your code to improve the tutorials. Thanks!

Best, Jean