waldronlab / bugsigdbr

R-side access to published microbial signatures from BugSigDB
https://bioconductor.org/packages/bugsigdbr
GNU General Public License v3.0
4 stars 2 forks source link

importBugSigDB(version = 'devel', cache = FALSE) not reading Study 727 #37

Closed sdgamboa closed 1 year ago

sdgamboa commented 1 year ago

I can't find study 727 in the bugsigdb download using devel. See code below.

library(bugsigdbr)
# https://bugsigdb.org/Study_727
bsdb <- importBugSigDB(version = 'devel', cache = FALSE)
head(bsdb$Study)
#> [1] "Study 1" "Study 1" "Study 1" "Study 1" "Study 1" "Study 1"
which(bsdb$Study == 'Study 727')
#> integer(0)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R Under development (unstable) (2022-12-25 r83502)
#>  os       Pop!_OS 22.04 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-04-18
#>  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  BiocFileCache   2.7.2   2023-02-17 [1] Bioconductor
#>  bit             4.0.5   2022-11-15 [2] CRAN (R 4.3.0)
#>  bit64           4.0.5   2020-08-30 [2] CRAN (R 4.3.0)
#>  blob            1.2.4   2023-03-17 [1] CRAN (R 4.3.0)
#>  bugsigdbr     * 1.5.8   2023-04-12 [1] Github (waldronlab/bugsigdbr@b956d97)
#>  cachem          1.0.7   2023-02-24 [1] CRAN (R 4.3.0)
#>  cli             3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  crayon          1.5.2   2022-09-29 [2] CRAN (R 4.3.0)
#>  curl            5.0.0   2023-01-12 [2] CRAN (R 4.3.0)
#>  DBI             1.1.3   2022-06-18 [2] CRAN (R 4.3.0)
#>  dbplyr          2.3.2   2023-03-21 [1] CRAN (R 4.3.0)
#>  digest          0.6.31  2022-12-11 [2] CRAN (R 4.3.0)
#>  dplyr           1.1.1   2023-03-22 [1] CRAN (R 4.3.0)
#>  evaluate        0.20    2023-01-17 [2] CRAN (R 4.3.0)
#>  fansi           1.0.4   2023-01-22 [2] CRAN (R 4.3.0)
#>  fastmap         1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  filelock        1.0.2   2018-10-05 [1] CRAN (R 4.3.0)
#>  fs              1.6.1   2023-02-06 [2] CRAN (R 4.3.0)
#>  generics        0.1.3   2022-07-05 [2] CRAN (R 4.3.0)
#>  glue            1.6.2   2022-02-24 [2] CRAN (R 4.3.0)
#>  htmltools       0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>  httr            1.4.5   2023-02-24 [1] CRAN (R 4.3.0)
#>  knitr           1.42    2023-01-25 [2] CRAN (R 4.3.0)
#>  lifecycle       1.0.3   2022-10-07 [2] CRAN (R 4.3.0)
#>  magrittr        2.0.3   2022-03-30 [2] CRAN (R 4.3.0)
#>  memoise         2.0.1   2021-11-26 [2] CRAN (R 4.3.0)
#>  pillar          1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig       2.0.3   2019-09-22 [2] CRAN (R 4.3.0)
#>  purrr           1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache         0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils         2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>  R6              2.5.1   2021-08-19 [2] CRAN (R 4.3.0)
#>  reprex          2.0.2   2022-08-17 [2] CRAN (R 4.3.0)
#>  rlang           1.1.0   2023-03-14 [1] CRAN (R 4.3.0)
#>  rmarkdown       2.21    2023-03-26 [1] CRAN (R 4.3.0)
#>  RSQLite         2.3.1   2023-04-03 [1] CRAN (R 4.3.0)
#>  rstudioapi      0.14    2022-08-22 [2] CRAN (R 4.3.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  styler          1.9.1   2023-03-04 [1] CRAN (R 4.3.0)
#>  tibble          3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect      1.2.0   2022-10-10 [2] CRAN (R 4.3.0)
#>  tzdb            0.3.0   2022-03-28 [2] CRAN (R 4.3.0)
#>  utf8            1.2.3   2023-01-31 [2] CRAN (R 4.3.0)
#>  vctrs           0.6.1   2023-03-22 [1] CRAN (R 4.3.0)
#>  vroom           1.6.1   2023-01-22 [2] CRAN (R 4.3.0)
#>  withr           2.5.0   2022-03-03 [2] CRAN (R 4.3.0)
#>  xfun            0.38    2023-03-24 [1] CRAN (R 4.3.0)
#>  yaml            2.3.7   2023-01-23 [2] CRAN (R 4.3.0)
#> 
#>  [1] /home/samuel/R/x86_64-pc-linux-gnu-library/4.3
#>  [2] /home/samuel/Apps/R-devel/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2023-04-18 with reprex v2.0.2

lgeistlinger commented 1 year ago

This seems to be related to https://github.com/waldronlab/BugSigDB/issues/174. I don't see Study 727 in the latest export on BugSigDBExports, but I see that it is now included in the study export directly on BugSigDB. That means I'd expect it to be included in the export on BugSigDBExports on Sunday.

lwaldron commented 1 year ago

But bsdb <- importBugSigDB(version = 'devel', cache = FALSE) should be taking directly from bugsigdb.org, no? It looks like an error in the bugsigdbr parsing.

lwaldron commented 1 year ago

And I don't see the relationship to https://github.com/waldronlab/BugSigDB/issues/174 - that is about Elastic Search indexing, this seems not to be related to any problem on bugsigdb.org. And this study was created more than two Sundays ago, so I think the error is propagating to the exports.

lgeistlinger commented 1 year ago

But bsdb <- importBugSigDB(version = 'devel', cache = FALSE) should be taking directly from bugsigdb.org, no?

No, it imports from BugSigDBExports, which does the merging of study, experiment, and signature table, filters incomplete records, adds signature ids, etc.

And I don't see the relationship to https://github.com/waldronlab/BugSigDB/issues/174

My impression is that Ike was running out of disk space and that these studies were thus not included in the export on BugSigDB.

lgeistlinger commented 1 year ago

You can actually run the dump_release.R script on BugSigDBExport manually, and will see that "Study 727" is now included and will thus also be included in the next export on Sunday, and then will also be available to be pulled via bugsigdbr.

lwaldron commented 1 year ago

Ah I didn't realize that devel pulled from bugsigdbexports. Then the reason it's still not appearing is that the Github action has been erroring for the past three attempts: https://github.com/waldronlab/BugSigDBExports/actions

lwaldron commented 1 year ago

I assigned to you @lgeistlinger because it looks like a file parsing error in in dump_release.R

lgeistlinger commented 1 year ago

Here it would be good if @jwokaty would monitor such repeated failures in the BugSigDBExports GHA (she receives an email about failed runs I believe), forwards the information about repeated failures, and takes action where possible. For the current situation, I don't believe there is anything else to do then to wait for Sunday as running the script manually works fine, and the hiccup seem to have been caused by a temporary ill-formatted / incomplete export on bugsigdb.org.

lwaldron commented 1 year ago

I just manually triggered a re-run of the latest GHA job (see here and it is still failing. I also tried running the script locally (Rscript BugSigDBExports/inst/scripts/dump_release.R $(date +'%F') BugSigDBExports) and it also errors for me locally:

Error in strsplit(bsdb[["MetaPhlAn taxon names"]], ",") : 
  non-character argument
Execution halted

Then I tried stepping through dump_release.R and found the problem - all signatures are marked as Incomplete and thus being removed:

Browse[2]> table(sigs$State)

Incomplete 
      2848 

I'm going to temporarily get rid of the completeness requirement for signatures because we have a Master's student needing to access recent data for her analysis, then open an issue on the bugsigdb repo. It doesn't seem like any change is needed in this repo other than adding some messages / warnings / errors to make something like this easier to diagnose.

lwaldron commented 1 year ago

After ignoring Incomplete signatures, I still see another error, this time from bugsigdbr::getSignatures() - this now seems like your domain @lgeistlinger :

https://github.com/waldronlab/BugSigDBExports/blob/f9c4f2961cca0fcf0c8fb0d74875ad0ec026cb14/inst/scripts/dump_release.R#L206

  else if (!all(tax.level %in% TAX.LEVELS)) 
    stop("tax.level must be a subset of { ", paste(TAX.LEVELS, 
      collapse = ", "), " }")
Browse[2]>     stop("tax.level must be a subset of { ", paste(TAX.LEVELS, 
+       collapse = ", "), " }")
Error during wrapup: tax.level must be a subset of { kingdom, phylum, class, order, family, genus, species, strain }
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[2]> tax.level
[1] "mixed"
Browse[2]> TAX.LEVELS
[1] "kingdom" "phylum"  "class"   "order"   "family"  "genus"   "species" "strain" 
lwaldron commented 1 year ago

Sorry, scratch that last post, was a debugging error. The actual error in this loop is (still trying to find a fix):

Error in vapply(spl, function(s) s[length(s)], character(1)) : 
  values must be length 1,
 but FUN(X[[1]]) result is length 0
lwaldron commented 1 year ago

The error seems to occur inside bugsigdbr::.extractTaxLevel() (values from debug within function):

function (bug, tax.level) 
{
  if (is.na(bug)) 
    return(bug)
  tip <- .getTip(bug)
  tl <- substring(tip, 1, 1)
  ind1 <- match(tl, MPA.TAX.LEVELS)
  ind2 <- match(tax.level, names(MPA.TAX.LEVELS))
  if (ind1 > ind2) {
    bug <- unlist(strsplit(bug, "\\|"))
    bug <- paste(bug[seq_len(ind2)], collapse = "|")
  }
  return(bug)
}

Browse[9]> bug
[1] "2|1239|91061"
Browse[9]> tax.level
[1] "mixed"
Browse[9]>   if (is.na(bug)) 
+     return(bug)
Browse[9]>   tip <- .getTip(bug)
Browse[9]> tip
[1] "91061"
Browse[9]>   tl <- substring(tip, 1, 1)
Browse[9]> tl
[1] "9"
Browse[9]>   ind1 <- match(tl, MPA.TAX.LEVELS)
Browse[9]>   ind2 <- match(tax.level, names(MPA.TAX.LEVELS))
Browse[9]> ind1
[1] NA
Browse[9]> ind2
[1] NA
Browse[9]>   if (ind1 > ind2) {
+     bug <- unlist(strsplit(bug, "\\|"))
+     bug <- paste(bug[seq_len(ind2)], collapse = "|")
+   }
Error during wrapup: missing value where TRUE/FALSE needed
lgeistlinger commented 1 year ago

Then I tried stepping through dump_release.R and found the problem - all signatures are marked as Incomplete and thus being removed

Thanks for tracing this and reporting this to Ike. It looks like I only ran the script up to where the full dump is written in line 179 and was happy with seeing Study 727 included and didn't notice things breaking a couple lines further down.

After ignoring Incomplete signatures, I still see another error, this time from bugsigdbr::getSignatures() - this now seems like your domain @lgeistlinger

With incomplete records included, there is potential for all kind of funny things to happen downstream. My preference here would be for Ike to restore the State column in the export with having incomplete records properly marked and then excluded on our side. If the problem persists on complete records, I'd be happy to take a closer look. Otherwise we are cooking up a solution for dirty data, which should actually be checked for and filtered out upstream.

lgeistlinger commented 1 year ago

It looks like the issue persists now that Ike has restored the State column for signatures. I'll be looking into that.

lgeistlinger commented 1 year ago

@sdgamboa @lwaldron

> library(bugsigdbr)
> df <- importBugSigDB(version = "devel", cache = FALSE)
> "Study 727" %in% df$Study
[1] TRUE
lwaldron commented 1 year ago

Yeah!