stemangiola / tidybulk

Brings bulk and pseudobulk transcriptomics to the tidyverse
163 stars 23 forks source link

test_gene_enrichment with tidySummarizedExperiment input #174

Closed mblue9 closed 3 years ago

mblue9 commented 3 years ago

Do we need to make some change to have test_gene_enrichment work with tidySummarizedExperiment input or should we just select the columns needed and input that way?

As was testing with the airway workshop input (using dev branch tidybulk version) and getting this error below.

options(connectionObserver = NULL) - # need to add this temporarily due to an [issue with rstudio](https://github.com/rstudio/rstudio/issues/9219) 
library(airway)
library(stringr)
library(EGSEA)
library(tidybulk)
library(tidySummarizedExperiment)

data(airway)

counts_scaled <- airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() 

# we need entrez ids for egsea
counts_annot <- counts_scaled %>% 
    mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    keys = feature,
    keytype = "ENSEMBL",
    column = "ENTREZID",
    multiVals = "first"))

test_gene_enrichment(counts_annot, .entrez=entrez, .abundance=counts_scaled, .formula=~ dex + cell, species="human", gene_collections = "h")
tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Error: object 'entrez' not found

Works if select the columns to input e.g.

# select columns and remove any duplicate entrez ids
counts_format <- counts_annot %>% 
    select(sample, entrez, counts_scaled, dex, cell) %>%
    aggregate_duplicates(.sample=sample, .transcript=entrez, .abundance=counts_scaled)

# run egsea
counts_format %>%
    test_gene_enrichment(.sample=sample, .entrez=entrez, .abundance=counts_scaled, .formula=~ 0 + dex + cell, .contrasts = c("dextrt - dexuntrt"),  species="human", gene_collections = "h")
stemangiola commented 3 years ago

I have updated dev branch. Please try again. I noticed the gene_collections is still set to "all". At least for SE I set to a character vector, so we can consolidate the SE and tibble streams.

mblue9 commented 3 years ago

I noticed the gene_collections is still set to "all". At least for SE I set to a character vector, so we can consolidate the SE and tibble streams.

I kept the default the same as what it was ie using all collections but yes you could change that if you like.

I have updated dev branch. Please try again.

Have just tested with current dev branch and I get the same error as above.

stemangiola commented 3 years ago

I kept the default the same as what it was ie using all collections but yes you could change that if you like.

Ah OK, I thought we both lined at the end the lower case gene collection list. I can't find the issue/pull from the history where we discussed this :P

Have just tested with current dev branch and I get the same error as above.

Oki sorry let me try today

mblue9 commented 3 years ago

Ah OK, I thought we both lined at the end the lower case gene collection list. I can't find the issue/pull from the history where we discussed this :P

We discussed it here https://github.com/stemangiola/tidybulk/pull/170#discussion_r612839867 You can specify a list of collections in lower case ("c2", "kegg_disease" etc) I just left "all" as a shortcut for all collections but I could remove if it's confusing?

stemangiola commented 3 years ago

Ah OK, I thought we both lined at the end the lower case gene collection list. I can't find the issue/pull from the history where we discussed this :P

We discussed it here #170 (comment) You can specify a list of collections in lower case ("c2", "kegg_disease" etc) I just left "all" as a shortcut for all collections but I could remove if it's confusing?

Ah ok. My point was that listing all gene collections was better than put "all". But also maybe I am not seeing the disadvantages.

mblue9 commented 3 years ago

I see I missed that, I thought you were just talking about adding the kegg prefix. I think you're right, better to let them specify the sets so they can see what they're using so I've made a PR to change it here https://github.com/stemangiola/tidybulk/pull/176

mblue9 commented 3 years ago

For the original issue here this assays() line seems to be where the issue occurs https://github.com/stemangiola/tidybulk/blob/dev/R/methods_SE.R#L1349 as it's extracting the counts with Ensembl ids instead of the Entrez ones so then entrez column is not found. Not sure how best to fix that.

stemangiola commented 3 years ago

For the original issue here this assays() line seems to be where the issue occurs https://github.com/stemangiola/tidybulk/blob/dev/R/methods_SE.R#L1349 as it's extracting the counts with Ensembl ids instead of the Entrez ones so then entrez column is not found. Not sure how best to fix that.

Looking at it.

stemangiola commented 3 years ago

Sorry I might not have update dev. I have a problem pushing from when I activated 2-factor authentication.

With dev branch https://github.com/stemangiola/tidybulk/commit/f08e6053ead6390fd39fd23690bff435ee216c0e I get

options(connectionObserver = NULL) - # need to add this temporarily due to an [issue with rstudio](https://github.com/rstudio/rstudio/issues/9219) 
library(airway)
library(stringr)
library(EGSEA)
library(tidybulk)
library(tidySummarizedExperiment)

data(airway)

airway %>%
    keep_abundant(factor_of_interest = dex) %>%
    scale_abundance() %>%

    # we need entrez ids for egsea
    mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
                                                                                keys = feature,
                                                                                keytype = "ENSEMBL",
                                                                                column = "ENTREZID",
                                                                                multiVals = "first")) %>%

    aggregate_duplicates(.transcript = entrez) %>%
    test_gene_enrichment(
        .entrez=entrez,
        .formula=~ dex + cell, 
        species="human", 
        gene_collections = "h"
    )
'select()' returned 1:many mapping between keys and columns
Converted to characters
factorfactorlogicalfactorfactorfactorfactorfactorfactorfactorfactorfactor
tidybulk says: The design column names are "(Intercept), dexuntrt, cellN061011, cellN080611, cellN61311"
[1] "Loading MSigDB Gene Sets ... "
[1] "Loaded gene sets for the collection h ..."
[1] "Indexed the collection h ..."
[1] "Created annotation for the collection h ..."
[1] "Building KEGG pathways annotation object ... "
tidybulk says: due to a bug in the call to KEGG database (http://supportupgrade.bioconductor.org/p/122172/#122218), the analysis for this database is run without report production.
EGSEA analysis has started
##------ Thu Apr 29 11:02:44 2021 ------##
The argument 'contrast' is recommended to be a matrix object.
See Vignette or Help.
Log fold changes are estimated using limma package ... 
limma DE analysis is carried out ... 
EGSEA is running on the provided data and h collection

##------ Thu Apr 29 11:02:59 2021 ------##
EGSEA analysis took 14.724 seconds.
EGSEA analysis has completed
EGSEA HTML report is being generated ...
##------ Thu Apr 29 11:02:59 2021 ------##
Report pages and figures are being generated for the h collection ...
   Heat maps are being generated for top-ranked gene sets 
based on logFC ... 
   Summary plots are being generated ... 
   Comparison summary plots are being generated  ...
##------ Thu Apr 29 11:06:59 2021 ------##
EGSEA report generation took 240.166 seconds.
EGSEA report has been generated.
# A tibble: 50 x 21
   data_base pathway web_page med.rank  p.value    p.adj vote.rank avg.rank min.pvalue min.rank avg.logfc
   <chr>     <chr>   <chr>       <dbl>    <dbl>    <dbl>     <dbl>    <dbl>      <dbl>    <dbl>     <dbl>
 1 h         HALLMA… https:/…        3 3.80e-14 1.90e-12         5     4.57   6.33e-15        1     0.958
 2 h         HALLMA… https:/…        6 2.63e- 7 4.38e- 6        10     8.86   4.38e- 8        1     0.947
 3 h         HALLMA… https:/…       10 2.36e- 3 1.24e- 2         5    15.6    3.94e- 4        2     0.766
 4 h         HALLMA… https:/…       10 3.46e- 3 1.57e- 2        10    15.1    5.77e- 4        2     0.924
 5 h         HALLMA… https:/…       11 2.49e- 2 5.67e- 2        15     9.86   4.20e- 3        2     0.887
 6 h         HALLMA… https:/…       12 7.86e- 3 2.81e- 2        15    18.7    1.31e- 3        4     0.788
 7 h         HALLMA… https:/…       15 1.82e- 2 4.78e- 2        15    18.6    3.05e- 3       11     0.865
 8 h         HALLMA… https:/…       16 2.51e- 1 3.48e- 1        15    20.1    4.70e- 2        4     0.618
 9 h         HALLMA… https:/…       16 6.33e- 3 2.43e- 2        35    20.3    1.06e- 3       11     0.890
10 h         HALLMA… https:/…       16 5.40e- 4 3.86e- 3        20    18.4    9.00e- 5        3     0.829
# … with 40 more rows, and 10 more variables: avg.logfc.dir <dbl>, direction <dbl>, significance <dbl>,
#   camera <dbl>, roast <dbl>, safe <dbl>, gage <dbl>, padog <dbl>, globaltest <dbl>, ora <dbl>
Warning messages:
1: In aggregate_duplicated_transcripts_bulk(.data, .sample = !!.sample,  :
  tidybulk says: for aggregation, factors and logical columns were converted to character
2: In eval(dots[[i]][[action]], env, env) :
  tidybulk says: There are NA entrez IDs. Those genes will be filtered
mblue9 commented 3 years ago

Great, can confirm with updated dev branch it now works for me 👍 Could close this.