waldronlab / bugsigdbr

R-side access to published microbial signatures from BugSigDB
https://bioconductor.org/packages/bugsigdbr
GNU General Public License v3.0
3 stars 3 forks source link

Release 3.18 #49

Closed lwaldron closed 7 months ago

lwaldron commented 11 months ago

Note there's currently an error in the export ERROR: dependency ‘BiocFileCache’ is not available for package ‘bugsigdbr’ but I assume this will resolve itself soon (https://github.com/waldronlab/BugSigDBExports/actions/runs/6520159139)

lgeistlinger commented 11 months ago

While most of the checks are now taking place either directly on the curation forms on bugsigdb.org or as part of the programmatic checks of the export script on BugSigDBExports, I would typically pull up the full_dump.csv in an excel sheet and would check whether there are no obvious malformatted entries of any nature.

In particular, I would check whether the key fields "Body site" and "Condition" are set appropriately and whether they have been mapped properly to their corresponding UBERON and EFO IDs. That means any NAs in those fields would need some attention and sanity checking.

For example, somebody has entered the condition of https://bugsigdb.org/Study_696 as "Nutraceuticals" and we see here NAs in the EFO column of the full dump. Figures that this needs to be cleaned up directly on bugsigdb.org to be entered as "nutraceutical" in order to be mapped to the corresponding EFO term.

lgeistlinger commented 11 months ago

Prior to releasing a new version to zenodo, please run the release by me. If you give me a bit more time I'd be happy to help with identifying and cleaning up some more issues in the current data dump, and by the way spell out some more checks that I apply here. I am currently a bit swamped but things will clear up somewhat towards the end of the week.

jwokaty commented 11 months ago

@lgeistlinger Does this mean that Body site and Condition should not have NA values? I see rows that have Body site == NA or Condition == NA. If a curation (row) has a State == Complete, has this already been sanity checked such that I only need to be concerned with curations/rows with Status == NA?

@lwaldron What is the best way to resolve NA values? Is there someone whom I can work with to resolve the NAs or a way to facilitate the process with bugsigdb.org?

lgeistlinger commented 11 months ago

While there might be instances where there might be good reasons for having Body site == NA and/or Condition == NA, those should be rare and double-checked by some reviewers such as Fatima, Rimsha, Chloe, or others that have recently been taking on the job (I think Svetlana). Only if State == Reviewed I'd say those could be taken as sanity checked. State == Complete just means that a curator considered the information complete, and there would still be some review / double-checking warranted in such NA cases.

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of Study 538

jwokaty commented 11 months ago

@lgeistlinger Does that mean only that study needed to be cleaned up? Or you only cleaned up that study but other studies with NA still need to be cleaned up?

lgeistlinger commented 11 months ago

Or you only cleaned up that study but other studies with NA still need to be cleaned up?

this is what I mean. we still need to go through the other NA instances.

jwokaty commented 11 months ago

@lgeistlinger Thanks.

Since I'm a little busy with the release @lwaldron I wanted to check if it is okay to complete all tasks after the release? The Outreachy contribution period will end on Monday I believe and I need to ask someone from the project to assist with the NAs.

lwaldron commented 11 months ago

For sure. But I'd still like to push the new Zenodo link to release 318 as well.

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_573. It is somewhat concerning that both studies I cleaned up so far were "reviewed" studies, but didn't annotate the condition field properly. Curators and reviewers should devote extra care to annotate the condition and body site field properly as those fields are typically key to most downstream analyses.

lgeistlinger commented 11 months ago

Cleaned up NA's in the condition field of https://bugsigdb.org/Study_580. Also corrected several studies (eg Study 580, 590, 599) that annotated HIV infection by using the actual taxon as the condition (!) instead of the corresponding EFO term denoting the condition of HIV infection.

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_584

lgeistlinger commented 11 months ago

Cleaned up NA's in Body site and the Condition field of https://bugsigdb.org/Study_616

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_652

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_658

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_696

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_698

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_702

lgeistlinger commented 11 months ago

Cleaned up NA's in the Body site field of https://bugsigdb.org/Study_709

lgeistlinger commented 11 months ago

Cleaned up NA's in the Condition field of https://bugsigdb.org/Study_715

jwokaty commented 10 months ago

@lgeistlinger Thanks for cleaning up all of these. I think we only have NA in Condition for Study 823 and Study 850 left.

lgeistlinger commented 10 months ago

Yes I have also gone over to check and clean up NAs in some other key fields as well including:

@cmirzayi @lwaldron note that I noticed several instances of study duplication where the PMID was NA, and curators apparently didn't understand the "PMID needs to be unique" error message and instead forced their way through the system and entered everything but the PMID manually into the form.

See eg. https://bugsigdb.org/Study_837 which is a duplicate of https://bugsigdb.org/Study_836.

It might be a good idea to reiterate with curators that the first thing is to always check whether the paper already exists in BugSigDB and if they don't understand the error message then searching the PMID in the general search bar on BugSigDB is also an option. Or is it because two people started the curation of the same paper at about the same time? Maybe in this case, but I already deleted a handful of other duplicated studies that displayed that pattern.

lgeistlinger commented 10 months ago

Note that there are some valid instances of NAs in the PMID column such as eg https://bugsigdb.org/Study_608 which are signatures compiled from cMD, and https://bugsigdb.org/Study_840 which is a preprint. This actually raises the question whether we include preprints in BugSigDB or require studies to be peer-reviewed @lwaldron ?

jwokaty commented 10 months ago

@lgeistlinger I just want to confirm that it's okay to proceed with doing the release? And just to clarify, do you want to look at the file again before I put the files in zenodo or are they already in a good state such that I can continue?

lwaldron commented 10 months ago

Yes we have allowed pre-prints to be added. PubMed now assigns PMIDs to biorxiv and medrxiv preprints we may be able to update these now, but not so important for release I think.

lgeistlinger commented 10 months ago

Yes we have allowed pre-prints to be added.

This might introduce some potential for study duplication it appears. Eg if somebody curated the preprint, and then a couple of months later another person curates the published paper. We might not even recognize that as it is not uncommon for paper titles to change between preprint and publication stage.

lgeistlinger commented 10 months ago

okay to proceed with doing the release?

I should have cleaned up all NAs in the above mentioned fields by tomorrow. There are still a lot of annotations that would benefit from more thorough review, eg I see lots of mistakes in the annotation of statistical test and Group 1 definition. Didn't even check the signatures yet. But will likely also not have the bandwidth to review much further and don't want to delay the release much further.

Would be good if I could check the version that you aim to release once more. There is a bit of a question how we go about releasing as it seems there is a lot of new data coming in from the outreachy students currently. Is there the option that I would commit a cleaned up version to BugSigDBExports and you go ahead releasing this one? Or are you basing the release necessarily on an hourly export?

jwokaty commented 10 months ago

@lgeistlinger We should select an hourly export for the release. You can also tell me which commit to select if you have a preference, such as the commit you last reviewed. Outreachy interns will be decided Monday, so we might see a reduction in the amount of new data after that. We can do it before or after the holiday, when you have time to review one final time. It's of course okay to do another release.

lgeistlinger commented 10 months ago

We should select an hourly export for the release. You can also tell me which commit to select if you have a preference, such as the commit you last reviewed

Alright.

lgeistlinger commented 10 months ago

Outreachy interns will be decided Monday

Ok then let's aim for doing the release Tue or Wed next week if that works for you.

lwaldron commented 10 months ago

We should also get all the duplicate signatures cleaned up as we've discussed this week in the #bugsigdb slack channel. @lgeistlinger feel free to post more assignments in that channel too. I think it is fine to base the release on a manual commit, just make sure to include the duplicates cleanup being done there too.

lwaldron commented 10 months ago

Oh I just saw above - yes @jwokaty is right it's better to use an hourly export for release.

lgeistlinger commented 10 months ago

just make sure to include the duplicates cleanup being done there too.

Has this already been completed or is expected to be completed by Tue next week or should we hold on with the release until this has been completed?

lwaldron commented 10 months ago

I think it will be completed by then

jwokaty commented 10 months ago

@lwaldron Have duplicates been cleaned up?

jwokaty commented 10 months ago

I think there are still a few duplicates. 16 left running To document the process for the next release, here's @lwaldron's code to check duplicates with len >= 5

BiocManager::install(c("bugsigdbr", "waldronlab/BugSigDBStats", "digest", "purrr", "dplyr"))
library(dplyr)
bsdb <- bugsigdbr::importBugSigDB(version = "devel")
sigs <- bugsigdbr::getSignatures(bsdb, 
                                 tax.id.type="taxname", 
                                 min.size = 5)
purrr::map_chr(sigs, digest::digest) %>%
  .[!Biobase::isUnique(.)] %>%
  sort()

See https://waldronlab.slack.com/archives/C04RATV9VCY/p1699929017492439.

jwokaty commented 10 months ago

Hi @lgeistlinger we're ready for you to review as duplicates have been removed. I think we want commit 8188136ea821e4c3a5daffdc255cabdd1edda2c6 to be the release since this is where there are last significant changes being made. (I noticed the most recent comment is just changing the date; maybe we need to change that.)

Will you please perform a review of this commit? (Or if you'd like, I can create a release in github of the commit and you can review that.)

lgeistlinger commented 10 months ago

Thanks @jwokaty - I'll review the commit that you've pointed out over the weekend and we should aim for a release on Monday if that works on your end.

lgeistlinger commented 10 months ago

There is an issue with Study 562 that causes some malformatting in the export: https://github.com/waldronlab/BugSigDB/issues/209

lwaldron commented 10 months ago

@lgeistlinger could you provide a shell of a unit test that we could use in GHA to catch this type of malformed file automatically?

lgeistlinger commented 10 months ago

Some more problems reported in slack: https://community-bioc.slack.com/archives/C04RATV9VCY/p1701627768758379

jwokaty commented 9 months ago

@lgeistlinger I've been trying to follow the issues on Slack, so I'm also asking here if everything has been resolved such that we can move forward with the release?

lgeistlinger commented 9 months ago

Thanks for following up @jwokaty - please go ahead with the release once you get a thumbs up from the reviewers that the above issue has been resolved.

jwokaty commented 9 months ago

R CMD check fails with the new release files.

R CMD check bugsigdbr_1.9.0.tar.gz                                                                                                                                       
* using log directory ‘/home/meatbag/Work/waldronlab/bugsigdbr/bugsigdbr.Rcheck’                                                                                                                                   
* using R version 4.3.2 (2023-10-31)                                                                                                                                                                               
* using platform: x86_64-pc-linux-gnu (64-bit)                                                                                                                                                                     
* R was compiled by                                                                                                                                                                                                
    gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                                                                                                      
    GNU Fortran (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                                                                                              
* running under: Ubuntu 22.04.3 LTS                                                                                                                                                                                
* using session charset: UTF-8                                                                                                                                                                                     
* checking for file ‘bugsigdbr/DESCRIPTION’ ... OK                                                                                                                                                                 
* this is package ‘bugsigdbr’ version ‘1.9.0’                                                                                                                                                                      
* package encoding: UTF-8                                                                                                                                                                                          
* checking package namespace information ... OK                                                                                                                                                                    
* checking package dependencies ... OK                                                                                                                                                                             
* checking if this is a source package ... OK                                                                                                                                                                      
* checking if there is a namespace ... OK                                                                                                                                                                          
* checking for executable files ... OK                                                                                                                                                                             
* checking for hidden files and directories ... OK                                                                                                                                                                 
* checking for portable file names ... OK                                                                                                                                                                          
* checking for sufficient/correct file permissions ... OK                                                                                                                                                          
* checking whether package ‘bugsigdbr’ can be installed ... OK                                                                                                                                                     
* checking installed package size ... OK                                                                                                                                                                           
* checking package directory ... OK                                                                                                                                                                                
* checking ‘build’ directory ... OK                                                                                                                                                                                
* checking DESCRIPTION meta-information ... OK                                                                                                                                                                     
* checking top-level files ... OK                                                                                                                                                                                  
* checking for left-over files ... OK                                                                                                                                                                              
* checking index information ... OK                                                                                                                                                                                
* checking package subdirectories ... OK                                                                                                                                                                           
* checking R files for non-ASCII characters ... OK                                                                                                                                                                 
* checking R files for syntax errors ... OK                                                                                                                                                                        
* checking whether the package can be loaded ... OK                                                                                                                                                                
* checking whether the package can be loaded with stated dependencies ... OK                                                                                                                                       
* checking whether the package can be unloaded cleanly ... OK                                                                                                                                                      
* checking whether the namespace can be loaded with stated dependencies ... OK                                                                                                                                     
* checking whether the namespace can be unloaded cleanly ... OK                                                                                                                                                    
* checking loading without being on the library search path ... OK                                                                                                                                                 
* checking dependencies in R code ... NOTE                                                                                                                                                                         
Unexported object imported by a ':::' call: ‘BiocFileCache:::.sql_set_expires’                                                                                                                                     
  See the note in ?`:::` about the use of this operator.                                                                                                                                                           
* checking S3 generic/method consistency ... OK                                                                                                                                                                    
* checking replacement functions ... OK                                                                                                                                                                            
* checking foreign function calls ... OK                                                                                                                                                                           
* checking R code for possible problems ... NOTE                                                                                                                                                                   
getMetaSignatures: no visible binding for global variable ‘Abundance in                                                                                                                                            
  Group 1’                                                                                                                                                                                                         
Undefined global functions or variables:                                                                                                                                                                           
  Abundance in Group 1                                                                                                                                                                                             
* checking Rd files ... OK                                                                                                                                                                                         
* checking Rd metadata ... OK                                                                                                                                                                                      
* checking Rd cross-references ... OK                                                                                                                                                                              
* checking for missing documentation entries ... OK                                                                                                                                                                
* checking for code/documentation mismatches ... OK                                                                                                                                                                
* checking Rd \usage sections ... OK                                                                                                                                                                               
* checking Rd contents ... OK                                                                                                                                                                                      
* checking for unstated dependencies in examples ... OK                                                                                                                                                            
* checking installed files from ‘inst/doc’ ... OK                                                                                                                                                                  
* checking files in ‘vignettes’ ... OK                                                                                                                                                                             
* checking examples ... ERROR                                                                                                                                                                                      
Running examples in ‘bugsigdbr-Ex.R’ failed                                                                                                                                                                  [0/72]
The error most likely occurred in:

> ### Name: getMetaSignatures
> ### Title: Obtain meta-signatures for a column of interest
> ### Aliases: getMetaSignatures
> 
> ### ** Examples
> 
>  df <- importBugSigDB()
Using cached version from 2023-12-19 19:45:51
> 
>  # Body-site specific meta-signatures composed from signatures reported as both 
>  # increased or decreased across all conditions of study:
>  bs.meta.sigs <- getMetaSignatures(df, column = "Body site")
> 
>  # Condition-specific meta-signatures from fecal samples, increased
>  # in conditions of study. Use taxonomic names instead of the default NCBI IDs:
>  df.feces <- df[df$`Body site` == "feces", ]
>  cond.meta.sigs <- getMetaSignatures(df.feces, column = "Condition", 
+                                      direction = "UP", tax.id.type = "taxname")
Error in names(sigs) <- paste(snames$id, snames$titles, sep = "_") : 
  'names' attribute [1] must be the same length as the vector [0]
Calls: getMetaSignatures -> getSignatures
Execution halted
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
   1. └─bugsigdbr (local) checkSubset(sdf, "Body site", bpos[[b]], bneg[[b]]) at test-ontology.R:62:9
   2.   └─testthat::expect_true(all(pos %in% sdf[, col])) at test-ontology.R:27:5
  ── Failure ('test-ontology.R:71:9'): subsetByOntology ──────────────────────────
  all(pos %in% sdf[, col]) is not TRUE

  `actual`:   FALSE
  `expected`: TRUE 
  Backtrace:
      ▆
   1. └─bugsigdbr (local) checkSubset(sdf, "Condition", cpos[[ct]], cneg[[ct]]) at test-ontology.R:71:9
   2.   └─testthat::expect_true(all(pos %in% sdf[, col])) at test-ontology.R:27:5

  [ FAIL 6 | WARN 0 | SKIP 0 | PASS 169 ]
  Error: Test failures
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ...
  ‘bugsigdbr.Rmd’ using ‘UTF-8’... OK
 NONE
* checking re-building of vignette outputs ... OK
* checking PDF version of manual ... OK
* DONE

Status: 2 ERRORs, 2 NOTEs
See
  ‘/home/meatbag/Work/waldronlab/bugsigdbr/bugsigdbr.Rcheck/00check.log’
for details.

Note: It might good to periodically run R CMD check on the new files.

jwokaty commented 9 months ago

@lgeistlinger @lwaldron I noticed that some of the capitalization has changed since the last BugSigDbExports release. I'm updating the tests that are affected by this, but I wonder if this might be a problem. For example, feces in v1.1.0 https://raw.githubusercontent.com/waldronlab/BugSigDBExports/v1.1.0/full_dump.csv to is now Feces in devel https://raw.githubusercontent.com/waldronlab/BugSigDBExports/devel/full_dump.csv. You can see the related changes at 3f6ee5c.

lgeistlinger commented 9 months ago

Thanks @jwokaty for pointing this out. There is indeed a problem with capitalization of the condition field upstream: https://github.com/waldronlab/BugSigDB/issues/111

lgeistlinger commented 8 months ago

There is some progress on https://github.com/waldronlab/BugSigDB/issues/111 but I'd like to wait for all conditions being first-letter capitalized before moving ahead with the new release, as we otherwise have to change the code, examples / vignettes, and tests twice.

lgeistlinger commented 8 months ago

Hi @jwokaty the capitalization of condition terms has been resolved in https://github.com/waldronlab/BugSigDB/issues/111 and you can now go ahead and prepare a new zenodo release and update bugsigdbr::importBugSigDB to use the new release version.

I checked and it looks this waldronlab/BugSigDBExports dump would be suitable for a full release:

acd70d7

Please just let me know if you have any questions or concerns.

jwokaty commented 8 months ago

@lgeistlinger Thanks for selecting the commit.

I was thinking that it might be nice to note the change in the capitalization in the news file. Would Make Body site and Condition terms uniform with first-letter capitalization be accurate to add in the news file?

lgeistlinger commented 8 months ago

Yes maybe slightly rephrased: "Uniform first-letter capitalization for Body site and Condition terms"