waldronlab / bugsigdbr

R-side access to published microbial signatures from BugSigDB
https://bioconductor.org/packages/bugsigdbr
GNU General Public License v3.0
4 stars 2 forks source link

subsetByOntology: include query term itself #44

Closed cmirzayi closed 1 year ago

cmirzayi commented 1 year ago

It appears that subsetByOntology() does not return terms with the designation MONDO, though these terms are present in the EFO. The EFO includes many terms with the prefix MONDO such as MONDO:0004985 (bipolar disorder) which is a unique term from the EFO terms "bipolar I disorder" and "bipolar II disorder" since it is inclusive of both these terms. BugSigDB articles, when curated, accept MONDO terms for curation.

Reproducible example:

dat <- bugsigdbr::importBugSigDB()
efo <- bugsigdbr::getOntology("efo")

An empty object is returned when subsetting by bipolar disorder, despite confirming it is both present in dat and efo:

> dat[dat$Condition=="bipolar disorder",] |> dim()
[1] 52 48
> efo$name[efo$id=="MONDO:0004985"]
     MONDO:0004985 
"bipolar disorder" 
> dat.bpd <- bugsigdbr::subsetByOntology(dat, column = "Condition", "bipolar disorder", efo)
> dim(dat.bpd)
[1]  0 48

This behavior is not observed when subsetting by a similar condition--unipolar depression--which is an EFO-prefixed term:

> dat[dat$Condition=="unipolar depression",] |> dim()
[1] 57 48
> efo$name[efo$id=="EFO:0003761"]
          EFO:0003761 
"unipolar depression" 
> dat.upd <- bugsigdbr::subsetByOntology(dat, column = "Condition", "unipolar depression", efo)
> dim(dat.upd)
[1]  23 48

Session Info

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.5.0       bugsigdbr_1.7.2     dplyr_1.1.2         bugSigSimple_0.99.5

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0    xfun_0.39           purrr_1.0.1         colorspace_2.1-0    vctrs_0.6.2        
 [6] generics_0.1.3      htmltools_0.5.5     viridisLite_0.4.2   BiocFileCache_2.4.0 yaml_2.3.7         
[11] utf8_1.2.3          blob_1.2.4          rlang_1.1.1         pillar_1.9.0        withr_2.5.0        
[16] glue_1.6.2          DBI_1.1.3           rappdirs_0.3.3      bit64_4.0.5         dbplyr_2.3.2       
[21] lifecycle_1.0.3     munsell_0.5.0       gtable_0.3.3        rvest_1.0.3         kableExtra_1.3.4   
[26] evaluate_0.21       memoise_2.0.1       knitr_1.42          fastmap_1.1.1       curl_5.0.0         
[31] fansi_1.0.4         ontologyIndex_2.10  Rcpp_1.0.10         scales_1.2.1        filelock_1.0.2     
[36] cachem_1.0.8        webshot_0.5.4       systemfonts_1.0.4   bit_4.0.5           ggplot2_3.4.2      
[41] digest_0.6.31       stringi_1.7.12      grid_4.2.1          cli_3.4.1           tools_4.2.1        
[46] magrittr_2.0.3      tibble_3.2.1        RSQLite_2.3.1       tidyr_1.3.0         pkgconfig_2.0.3    
[51] xml2_1.3.4          rmarkdown_2.21      svglite_2.1.1       httr_1.4.6          rstudioapi_0.14    
[56] R6_2.5.1            compiler_4.2.1     
lgeistlinger commented 1 year ago

Hi @cmirzayi - sorry for the delay.

This might be a misconception of what the function is supposed to do. From the specification in the vignette:

"More specifically, subsetting BugSigDB signatures by an EFO term then involves subsetting the Condition column to all descendants of that term in the EFO ontology and that are present in the Condition column"

Now what are the descendants of bipolar disorder in the EFO ontology?

> efo$children[["MONDO:0004985"]]
[1] "EFO:0009963" "EFO:0009964"
> efo$name[c("EFO:0009963","EFO:0009964")]
          EFO:0009963           EFO:0009964 
 "bipolar I disorder" "bipolar II disorder"

Are any of the descendants present in the Condition column?

> c("bipolar I disorder", "bipolar II disorder") %in% dat$Condition
[1] FALSE FALSE

So far so correct. Now one could argue whether one would want to include all signatures associated with the term itself when subsetting, but in this case it would be more straightforward to just subset via:

> dat.bpd <- subset(dat, Condition == "bipolar disorder")
> dim(dat.bpd)
[1] 18 48
lwaldron commented 1 year ago

The most common use case I would think for subsetByOntology would be to choose the ontology term and all of its descendants. How is that use case supported? I also find it unintuitive that subsetByOntology would return only descendants, without even an argument option to include the term itself.

lgeistlinger commented 1 year ago

Yes it would make sense to add that option.

lgeistlinger commented 1 year ago

Included the term itself in the query as of bugsigdbr v1.7.3. Available from github and bioc devel only for the moment.

> library(bugsigdbr)
> dat <- bugsigdbr::importBugSigDB()
Using cached version from 2023-04-26 22:22:42
> efo <- bugsigdbr::getOntology("efo")
Loading required namespace: ontologyIndex
Using cached version from 2022-09-06 23:05:35
> dat.bpd <- bugsigdbr::subsetByOntology(dat, column = "Condition", "bipolar disorder", efo)
> dim(dat.bpd)
[1] 18 48
> table(dat.bpd$Condition)

bipolar disorder 
              18 
lwaldron commented 1 year ago

Nice!!

cmirzayi commented 1 year ago

Awesome thank you Ludwig!