ropensci / software-review

rOpenSci Software Peer Review.
287 stars 104 forks source link

Submission: BaseSet #359

Closed llrs closed 3 years ago

llrs commented 4 years ago

Submitting Author: Lluís (@llrs)
Repository: llrs/BaseSet Version submitted: 0.0.10 Editor: @annakrystalli
Reviewer 1: @arendsee
Reviewer 2: @j23414 Archive: TBD
Version accepted: TBD


Package: BaseSet
Title: Provides classes for working with sets
Version: 0.0.10
Authors@R: 
    person(given = "Lluís ",
           family = "Revilla Sancho",
           role = c("aut", "cre"),
           email = "lluis.revilla@gmail.com")
Description: A set collection, while not "tidy" in itself, can
    be thought of as three tidy data frames describing sets, elements and
    relations respectively. 'BaseSet' provides an approach to manipulate,
    load and use these virtual data frames.
License: MIT + file LICENSE
URL: https://github.com/llrs/BaseSet
BugReports: https://github.com/llrs/BaseSet/issues
Depends: 
    R (>= 3.6.0)
Imports: 
    dplyr (>= 0.7.8),
    magrittr,
    methods,
    rlang,
    utils,
    xml2
Suggests: 
    BiocStyle,
    covr,
    forcats,
    ggplot2,
    GO.db,
    GSEABase,
    knitr,
    org.Hs.eg.db,
    reactome.db,
    rmarkdown,
    spelling,
    testthat (>= 2.1.0),
    Biobase
VignetteBuilder: 
    knitr
Encoding: UTF-8
Language: en-US
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.0.2
Collate: 
    'AllClasses.R'
    'AllGenerics.R'
    'GMT.R'
    'GeneSetCollection.R'
    'activate.R'
    'add.R'
    'add_column.R'
    'add_relation.R'
    'adjacency.R'
    'arrange.R'
    'basesets-package.R'
    'cartesian.R'
    'complement.R'
    'data_frame.R'
    'deactivate.R'
    'droplevels.R'
    'elements.R'
    'filter.R'
    'group.R'
    'group_by.R'
    'head.R'
    'incidence.R'
    'independent.R'
    'operations.R'
    'intersection.R'
    'length.R'
    'list.R'
    'move_to.R'
    'mutate.R'
    'names.R'
    'naming.R'
    'nested.R'
    'obo.R'
    'power_set.R'
    'print.R'
    'pull.R'
    'relations.R'
    'remove.R'
    'remove_column.R'
    'rename.R'
    'select.R'
    'set.R'
    'size.R'
    'subtract.R'
    'tidy-set.R'
    'union.R'
    'utils-pipe.R'
    'xml.R'
    'zzz.R'

Scope

The package implements methods to work on sets, doing intersection, union, complementary and other set operations in a "tidy" way. It also allows to import from several formats used in the life science world. Like the GMT and the GAF or the OBO format file for ontologies.

The idea is to use the package for working with sets and signatures of genes in scRNAseq or in pathways and ontologies but it might work with other fields.

There is the sets package which implements a more generalized approach, that can store functions or lists as an element of a set (while mine it only allows to store a character or factor), but it is harder to operate in a tidy/long way. Also the operations of intersection and union need to happen between two different objects, while TidySet objects (the class implemented in BaseSet) can store a single set or thousands of them. In BaseSet is easier to operate and implement new fuzzy logic operations. It is developed openly on github compared to sets which I couldn't track how it is being developed.

The GSEABase partially implements this, but it doesn't allow to store fuzzy sets and it is also quite slow as it creates several classes for annotating each set. Neither does the BiocSets the package, which don't use the fuzzy set logic.

There is also the hierarchicalSets package that is focused on clustering of sets that are inside other sets and visualizations. However, BaseSet is focused on storing and manipulate sets including hierarchical sets.

Most of the replies are copied from #339, handeled by @melvidoni.

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options - [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

annakrystalli commented 4 years ago

Hi @llrs! 👋

I'll be the editor handling your review. I'm starting the inital editors' checks shortly. Will check back in when they are done. 😊

annakrystalli commented 4 years ago

Hello again @llrs!

Thanks for your patience with the editors checks. I've been traveling but I've also been having some issues with the checks which I'm having to seek some help with. Will be with you shortly, if not with the full checks, at least with those that I am able to perform successfully.

annakrystalli commented 4 years ago

Editor checks:


Editor comments

Installation issues

When I try to do a full install and build vignettes, some suggested bioconductor dependencies are not installing and causing errors when attempting to build the vignettes.

devtools::install_github("llrs/BaseSet", dependencies = T, build_vignettes = T)
#> Installing 7 packages: BiocStyle, GO.db, GSEABase, org.Hs.eg.db, reactome.db, Biobase, BH
#> Error: (converted from warning) packages 'BiocStyle', 'GO.db', 'GSEABase', 'org.Hs.eg.db', 'reactome.db', 'Biobase' are not available (for R version 3.6.0)

***

   Error: Vignette re-building failed.
   Execution halted
Error in (function (command = NULL, args = character(), error_on_status = TRUE,  : 
  System command error

Reviewevers as well as potential future collaborators that will need to run devtools::check() and devtools::document(), need a section in the README or a separate CONTRIBUTING.md documenting the installation instructions required to install all dependencies (including suggested).

eg:

BiocManager::install("BiocStyle")
BiocManager::install("org.Hs.eg.db", type = "source")
BiocManager::install("GO.db", type = "source")
BiocManager::install("reactome.db", type = "source")

# test that package install and docs build successfully
devtools::install_github("llrs/BaseSet", dependencies = T, build_vignettes = T, force = T)

You can check out more of the adventures I had troubleshooting in this issue comment for more details. If you have some extra insight (as I imagine you are more familiar with bioconductor) to add to the discussion it would be more than welcome!

check() and test()

All good

goodpractice::gp() output

A few flags for minor formatting.

It is good practice to

  ✖ avoid long code lines, it is bad for
    readability. Also, many people prefer editor windows
    that are about 80 characters wide. Try make your
    lines shorter than 80 characters

    R/AllClasses.R:105:1
    R/AllGenerics.R:149:1
    R/AllGenerics.R:163:1
    R/length.R:171:1
    R/obo.R:122:1
    ... and 16 more lines

This is actually a bit more important. Is there a reason you are importing methods and utils packages as a whole?

  ✖ not import packages as a whole, as this
    can cause name clashes between the imported
    packages. Instead, import only the specific
    functions you need.
──────────────────────────────────────────────────────────────── 

Test coverage

Test coverage is overall good but is there a reason there are some files that are not tested at all or below 75%?

See the files indicated below:

BaseSet Coverage: 89.13%
R/xml.R: 0.00%
R/elements.R: 62.50%
R/set.R: 62.50%
R/size.R: 71.43%
R/pull.R: 73.33%

@llrs, could you please add the rOpenSci under review badge to your README?

[![](https://badges.ropensci.org/359_status.svg)](https://github.com/ropensci/software-review/issues/359)

Other than these minor issues, I am happy to start seeking reviewers.


Reviewers: Due date:

llrs commented 4 years ago

Many thanks for your review @annakrystalli

I'll modify the package to address your points by the beginning of next week I'll have them addressed. I'll let you know when I updated the package.

annakrystalli commented 4 years ago

Hello again @llrs,

Aha, it seems I have been trying to install the package the wrong way all this time, by using devtools::use_github() as I normally do for editors checks when I should be using BiocManager::install()

However, when I try to install using BiocManager::install("llrs/BaseSet", dependencies = TRUE, build_vignettes = TRUE) I still get the same errors for a couple of suggested bioconductor packages I removed to test it out.

I'm really intrigued by the fact that the packages builds successfully for you on TRAVIS. Looking at your travis.yml I see you are testing on r: bioc-devel only. I think this where the testing computational environment and my own (and of reviewers and future collaborators) would differ. I feel also testing on r: release at the very least would give a better idea of what challenges others would encounter. Would you mind trying that out for me?

llrs commented 4 years ago

Some of this errors might be due to a failed built on Bioconductor, they recently made some changes on the checks performed on packages. Could you please post the error you got?

On travis I am only testing on bioc-devel, but on my computer I am using R 3.6.1 with Bioconductor 3.10 and it builds successfully. I'm not sure this difference is the source of the problems you found. However, I'll add the build and test for r release.

annakrystalli commented 4 years ago

So the errors I am getting are the same as I was previously with using devtools::install_github(), namely, any missing bioconductor packages that need installing from source are missing. I've deleted package GO.db from my system and so now, the vignette that loads it is failing. Here's a condensed error output:

BiocManager::install("llrs/BaseSet", dependencies = TRUE, build_vignettes = TRUE)
E  creating vignettes (8.6s)
   --- re-building ‘advanced.Rmd’ using rmarkdown

   Attaching package: 'dplyr'

.
.
.

   Quitting from lines 49-56 (advanced.Rmd) 
   Error: processing vignette 'advanced.Rmd' failed with diagnostics:
   there is no package called 'GO.db'
   --- failed re-building ‘advanced.Rmd’

   --- re-building ‘basic.Rmd’ using rmarkdown
   --- finished re-building ‘basic.Rmd’

   --- re-building ‘fuzzy.Rmd’ using rmarkdown
   --- finished re-building ‘fuzzy.Rmd’

   SUMMARY: processing the following file failed:
     ‘advanced.Rmd’

   Error: Vignette re-building failed.
   Execution halted
Error: Failed to install 'BaseSet' from GitHub:
  System command error, exit status: 1, stdout + stderr (last 10 lines):
E> --- finished re-building ‘basic.Rmd’
E> 
E> --- re-building ‘fuzzy.Rmd’ using rmarkdown
E> --- finished re-building ‘fuzzy.Rmd’
E> 
E> SUMMARY: processing the following file failed:
E>   ‘advanced.Rmd’
E> 
E> Error: Vignette re-building failed.
E> Execution halted

The reason this would not fail on your system is because you already have all the necessary package installed, but reviewers or new contributors won't necessarily. The reason I think the travis tests are passing is because of you are testing on bioc-devel. I can't think of why else installation on TRAVIS works while it doesn't for me. Testing on r: release feels like it would give you a more realistic situtation of what manual install steps would be required at any point by developers/ reviewers to be able to fully test and check the package.

Out of interest, you could try removing GO.db locally on your system and see what happens. For info, here's my session info too, ie I'm on a mac not windows.

> session_info()
─ Session info ───────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.3        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_GB.UTF-8                 
 ctype    en_GB.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-01-30                  

─ Packages ───────────────────────────────────────────────────────────────────────────
 package     * version     date       lib source                             
 assertthat    0.2.1       2019-03-21 [1] CRAN (R 3.6.0)                     
 backports     1.1.5       2019-10-02 [1] CRAN (R 3.6.0)                     
 BiocManager   1.30.10     2019-11-16 [1] CRAN (R 3.6.0)                     
 callr         3.4.1       2020-01-24 [1] CRAN (R 3.6.0)                     
 cli           2.0.1       2020-01-08 [1] CRAN (R 3.6.0)                     
 clipr         0.7.0       2019-07-23 [1] CRAN (R 3.6.0)                     
 crayon        1.3.4       2017-09-16 [1] CRAN (R 3.6.0)                     
 curl          4.3         2019-12-02 [1] CRAN (R 3.6.0)                     
 desc          1.2.0       2018-05-01 [1] CRAN (R 3.6.0)                     
 devtools    * 2.2.1       2019-09-24 [1] CRAN (R 3.6.0)                     
 digest        0.6.23      2019-11-23 [1] CRAN (R 3.6.0)                     
 ellipsis      0.3.0       2019-09-20 [1] CRAN (R 3.6.0)                     
 evaluate      0.14        2019-05-28 [1] CRAN (R 3.6.0)                     
 fansi         0.4.1       2020-01-08 [1] CRAN (R 3.6.0)                     
 fs            1.3.1       2019-05-06 [1] CRAN (R 3.6.0)                     
 glue          1.3.1       2019-03-12 [1] CRAN (R 3.6.0)                     
 htmltools     0.4.0       2019-10-04 [1] CRAN (R 3.6.0)                     
 knitr         1.27        2020-01-16 [1] CRAN (R 3.6.0)                     
 magrittr      1.5         2014-11-22 [1] CRAN (R 3.6.0)                     
 memoise       1.1.0       2017-04-21 [1] CRAN (R 3.6.0)                     
 packrat       0.5.0       2018-11-14 [1] CRAN (R 3.6.0)                     
 pkgbuild      1.0.6       2019-10-09 [1] CRAN (R 3.6.0)                     
 pkgload       1.0.2       2018-10-29 [1] CRAN (R 3.6.0)                     
 prettyunits   1.1.1       2020-01-24 [1] CRAN (R 3.6.0)                     
 processx      3.4.1       2019-07-18 [1] CRAN (R 3.6.0)                     
 ps            1.3.0       2018-12-21 [1] CRAN (R 3.6.0)                     
 R6            2.4.1       2019-11-12 [1] CRAN (R 3.6.0)                     
 Rcpp          1.0.3       2019-11-08 [1] CRAN (R 3.6.0)                     
 remotes       2.1.0       2019-06-24 [1] CRAN (R 3.6.0)                     
 reprex      * 0.3.0       2019-05-16 [1] CRAN (R 3.6.0)                     
 rlang         0.4.3       2020-01-24 [1] CRAN (R 3.6.0)                     
 rmarkdown     2.1         2020-01-20 [1] CRAN (R 3.6.0)                     
 rprojroot     1.3-2       2018-01-03 [1] CRAN (R 3.6.0)                     
 rstudioapi    0.10.0-9000 2019-05-30 [1] Github (rstudio/rstudioapi@31d1afa)
 sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.6.0)                     
 testthat    * 2.3.1       2019-12-01 [1] CRAN (R 3.6.0)                     
 usethis     * 1.5.1.9000  2020-01-29 [1] Github (r-lib/usethis@4194fd6)     
 whisker       0.4         2019-08-28 [1] CRAN (R 3.6.0)                     
 withr         2.1.2       2018-03-15 [1] CRAN (R 3.6.0)                     
 xfun          0.12        2020-01-13 [1] CRAN (R 3.6.0)                     

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
llrs commented 4 years ago

I've updated the package as requested on your initial review.

However, I think that the configuration for .travis.yaml that better matches CRAN is using bioc-release. Because otherwise it doesn't install the Bioconductor packages. I hope that with the new instructions to install the Bioconductor packages reviewers and users will be able to install all of its dependencies.

annakrystalli commented 4 years ago

Hi @llrs,

Sorry, I probably didn't make myself clear enough. For sure you should be testing on bioc-release (perhaps on bioc-devel also). What I was suggesting is that you run a matrix of jobs that include testing on r-release and including any manual setup code during the build to both test that any awkward dependencies required for a reviewer/collaborator to work with the source code locally have: a) been identified b) their installation has been successfully specified.

Does that make sense? I've opened a PR to your repo with my attempt at implementing this. Let's see if it's successful!

llrs commented 4 years ago

Well, my impression was that CI should mimic the steps done on CRAN, not for the users. The steps are different, given that CRAN installs Bioconductor packages to be able to build packages which depend on Bioconductor.
That's why I modified the .travis.yml to only test on bioc-release, which is also what users need in order to install BaseSet. But it makes sense also to test the specific steps that users need to do.

I will try to modify the github actions to test also on OS, windows and Linux in both conditions to be sure.

annakrystalli commented 4 years ago

Hello again @llrs 👋

I'm still looking for reviewers currently. I had a potential reviewer respond recently that sadly they did not have time at the moment but they also made the following comment:

the description in the Readme doesn't give me enough information to figure out what this package is about.

I generally agree with the comment and was imagining it would be picked up during review but I now think it's probably best to deal with it now as it might be making it difficult to find reviewers.

So could you add a little more detail to the README as well as the description in the DESCRIPTION file? In particular, the information you have included here (in the package submission issue) gives a good idea of the strengths and domains of the package which are not so clear currently in the README.

llrs commented 4 years ago

Hi @annakrystalli,

Thanks for sending the feedback, I appreciate it. I've edited both the DESCRIPTION and the README. Hope this way it will be easier to get reviewers.

llrs commented 4 years ago

@annakrystalli I've been experimenting with macOS builders and using BiocManager::install('llrs/BaseSet', dependencies = TRUE, build_vignettes = TRUE) it could install BaseSet and all its dependencies from Bioconductor without problems.
The test was on Catalina OS, CRAN checks the packages with El Captain, and you are using Mojave version. I'm not sure if I can find a CI with Mojave.

noamross commented 4 years ago

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

llrs commented 4 years ago

About the installation problems there have been a discussion about problems installing packages from source and from binaries. See these messages from Bioc-devel mailing list: https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016689.html. In summary it seems that there is a bug in the logic of install.packages when mixing binary and source installs.

Just for future references

annakrystalli commented 4 years ago

⚠️⚠️⚠️⚠️⚠️ In the interest of reducing load on reviewers and editors as we manage the
COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned.
Reviews and responses to reviews will be handled on a 'best effort' basis, but
no follow-up reminders will be sent. Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great
appreciation for the work of our authors and reviewers. Stay healthy and take
care of one other.

The rOpenSci Editorial Board ⚠️⚠️⚠️⚠️⚠️

llrs commented 4 years ago

@annakrystalli It's been a week since last message, and I haven't seen any plan announced here, on the website or on twitter. Is there a plan to start again looking for reviewers ? PS: I am considering submitting directly to CRAN without waiting for a review from the volunteers of rOpenSci if it fits the new scope.

annakrystalli commented 4 years ago

Hi @llrs,

Apologies for the slow motion on here. I've had a reviewer agree to review before lockdown but I'm still having trouble locating the second one. I should have communicated that here last week. Indeed if you have any suggestions for potential reviewers (that would not have a conflict of interest) they would be gratefully received.

Regarding CRAN I think it is fine to submit to CRAN independently of the rOoenSci review. You could always just push a new release to CRAN once the rOpenSci review is done.

As for publicising our return to reviewing, we are currently just gauging submission rates before we make a full announcement next week.

llrs commented 4 years ago

Hi @annakrystalli Thanks for the quick reply, hope that everything is fine at your place.

I don't know how important is to wait for the second reviewer, maybe the first one can start without waiting for the second one? There are some authors of related packages that I would appreciate their feedback even if they have some conflict of interest against the package (I pasted their Github username): RaphaelS1, kevinrue.

annakrystalli commented 4 years ago

The conflict of interest is for us really (rOpenSci) to ensure we get a fair and unbiased review. In any case I could probably start the review with just a single reviewer. Let me just run all this past the rest of the editors and get back to you.

annakrystalli commented 4 years ago

Success!!! We have now both our reviewers! 🥳🙌

Thanks so much for agreeing to review @arendsee and @j23414 👍


Reviewer: @arendsee Reviewer: @j23414 Due date: 2020-06-29

llrs commented 4 years ago

Many thanks in advance @arendsee and @j23414. As you can see on the isse you might find some difficulties installing the package. Let me know and I'll be happy to help with them.

arendsee commented 4 years ago

@llrs I installed the package from github and devtools::check() passed without unusual trouble (I'm on R4.0.0). So far so good!

j23414 commented 4 years ago

@llrs Same here, I used BiocManager::install, and library(BaseSet) looks good on MacOS, R4.0.0. I'll clone the repo and do a devtools::check

arendsee commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

One of the vignettes should discuss the theory behind the fuzzy set operators (see comments below).

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

Functionality

Currently check fails on my system (R4.0.0 on Linux).

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

BaseSet is a well-designed and neatly written package that has clear value within the bioinformatics community and beyond. The functions are well documented and the test coverage is quite good. I think BaseSet should be admitted to rOpenSci once the issues I detail below are addressed. Most of them are minor. There are two major points:

  1. check currently does not pass on my system. I can build the vignettes and run the examples, so this is likely a minor dependency issue, or something, but it needs to be fixed.

  2. The theory behind the package (fuzzy set theory) needs much more explanation and justification. As it stands, I think most bioinformaticians and statisticians will misinterpret what the package is actually doing.

Stylistic issues

Line-by-line comments

A quick scan of the code tells me it is generally clean and well organized. I read the vignettes and spot-checked a few files in more detail. Below I've listed a few minor issues with file names and line numbers relative to commit 7d0f6da.

spelling and grammar

miscellaneous

(Optional) in the advanced vignette, replace humans with yeast

The human genome, reactome, and GO dependencies of the advanced vignette take a very long time to download. Since one use case of a vignette is as a demo for a class or workshop, an hour long download is not ideal. If you want, you could replace the human data sets with smaller ones from yeast.

Clarify annotations of "..."

Below is a list of every "..." documentation string:

mutate.R:12:       #' @param ... The logical predicates in terms of the variables of the sets
power_set.R:8:     #' @param ... Other arguments passed down if possible
GMT.R:8:           #' @param ... Other arguments passed to strsplit
group.R:8:         #' @param ... A logical condition to subset some elements
complement.R:80:   #' @param ... Other arguments passed to either \code{\link{complement_set}} or
group_by.R:12:     #' @param ... The logical predicates in terms of the variables of the sets
list.R:17:         #' @param ... objects, possibly named (currently ignored).
select.R:14:       #' @param ... The name of the columns you want to keep, remove or rename.
droplevels.R:36:   #' @param ... Other arguments, currently ignored.
data_frame.R:22:   #' @param ... Other arguments currently ignored.
filter.R:13:       #' @param ... The logical predicates in terms of the variables of the sets
union.R:19:        #' @param ... Other arguments.
add.R:92:          #' @param ... Other arguments passed along.
cartesian.R:11:    #' @param ... Other arguments passed down if possible
AllGenerics.R:328: #' @param ... Other arguments.
AllGenerics.R:352: #' @param ... Other arguments.
AllGenerics.R:372: #' @param ... Other arguments.
AllGenerics.R:392: #' @param ... Other arguments.
AllGenerics.R:413: #' @param ... Other arguments.
size.R:5:          #' @param ... Other arguments to filter which size should be shown.
arrange.R:14:      #' @param ... Comma separated list of variables names or expressions
Most of your documentation for "..." is fine. For example, "Comma separated list of variables names or expressions" tells me exactly what I need to know. Also, "Other arguments, currently ignored", is a perfectly acceptable.  However, "Other arguments", is not helpful and I would have to dig through the source code to figure out what parameters are supported (and I've had to do exactly this for a lot of packages).

Document and test the handling of empty sets

A TidySet is a set of elements

x <- list(a=c("A", "B"), b=character(0))
s <- tidySet(x)  # good - no errors are raised and the set looks right
elements(s)      # good 
sets(s)          # debatable - the empty set disappeared

The "right" behaviour of sets(s) is debatable, but whatever you decide, the behavior of edge cases should be documented and covered in the test suite. I have no problem with the empty set disappearing, though it could cause problems with downstream code if the user assumes every input set will be listed in sets(s).

In the sets documentation, I think the behavior would be clear if you just changed the Description to, "Given a TidySet, retrieve all non-empty sets or substitute them".

Document and test the ordering of sets

Should users assume sets are unordered? If so, you should explicitly say that sets are unordered and the order in which they are printed may change between implementations and should not be relied on in production code.

What happens when I add an element to a set? Is a new row added to the end of the table?

The add_column function seems a bit dangerous since it assumes an order. Just something to think about, you don't need to change anything.

Fuzzy set math explanation

Consider the following borrowed from the fuzzy.Rmd vignette:

> set.seed(4567)
> relations <- data.frame(sets = c(rep("A", 5), "B", "C"),
                          elements = c(letters[seq_len(6)], letters[6]),
                          fuzzy = runif(7))
> fuzzy_set <- tidySet(relations)
> fuzzy_set
  elements sets     fuzzy
1        a    A 0.2309186
2        b    A 0.7412554
3        c    A 0.1833834
4        d    A 0.4221682
5        e    A 0.5996399
6        f    B 0.2773313
7        f    C 0.7307312
> BaseSet::union(fuzzy_set, sets=c("C", "B"))
  elements sets     fuzzy
1        f  C∪B 0.7307312

When I first saw this result, I thought the math was simply wrong. I would expect that the probability that f is in (C or B) would be greater than or equal to the probability that f is only in C. Why is it exactly equal? Statistically, this would only be true if B were a subset of C.

After a bit of research, though, I found you are indeed using the standard operator definitions for fuzzy sets. If the fuzziness were the same as probability, then the union operator is a function that returns the lower bound of the probability of membership in the larger set. Do sets of genes expressed in different tissues follow the assumptions the union operator is making?

Fuzzy sets are based on fuzzy logic. In fuzzy logic, the 0 to 1 values are not probabilities, they are truth values. They are measures of the vagueness of a term not the than the chance of membership. You should describe the difference both in your README and the fuzzy vignette.

Most statisticians and bioinformaticians will not be familiar with fuzzy logic. Also, most programs in the field deal with classification probabilities not fuzzy truth values. Although, it is possible that the values associated with tissue annotations are measures of the degree to which a gene is active in the specific tissue, in which case the fuzzy logic may apply.

So, you definitely need to describe the math and statistics behind the operators. Is there any prior research on the application of fuzzy sets in bioinformatics? If so, you might want to add some citations. You also need to justify the choice to use fuzzy set theory rather than more conventional statistics.

arendsee commented 4 years ago

About the failing check, for some reason, the first time I ran check everything passed. Possibly because of what I had defined in my R environment before running check. I'm not sure since I haven't been able to replicate the first successful check. I've attached the check result.

check-result.txt

And here is my session info:

─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.0 (2020-04-24)
 os       Manjaro Linux               
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Chicago             
 date     2020-06-17                  

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 backports     1.1.7   2020-05-13 [1] CRAN (R 4.0.0)
 callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools    * 2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
 digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 magrittr    * 1.5     2014-11-22 [1] CRAN (R 4.0.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
 pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
 ps            1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 remotes       2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
 usethis     * 1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)

[1] /home/z/R/library
[2] /home/z/R/x86_64-pc-linux-gnu-library/4.0
[3] /usr/lib/R/library
llrs commented 4 years ago

Many thanks for the fast, long and thoughtful review @arendsee! I will start fixing some of the issues you mentioned while I think about the theoretical question. Hope to reply them in a couple of days.

llrs commented 4 years ago

I updated the package following your comments @arendsee. See below for some questions and comments that I still need to address.

Review Comments

Stylistic issues

Line-by-line comments

(Optional) in the advanced vignette, replace humans with yeast

Clarify annotations of "..."

Document and test the handling of empty sets

Document and test the ordering of sets

When a new element is added to a set via add_relation, first it is added to the end of the elements table, then it is added on the relations table at the end too.

The add_column adds columns to the existing table in the given order. It can be dangerous if the expected order is not meet, but I don't think it will be used much, as there are better alternatives to add a column and set some values on just some rows such as using mutate(TidySet, column = case_when()).

I do not know enough of ordered sets to know if I should provide specific support to them, or how are they usually used.

Fuzzy set math explanation

In this case both sets B and C are 1 element size, both of them have the f element. So taking the higher bound of the common values makes the probability of f equal to C. Note that the union does not calculate the probability that f is on C or B, but which is the probability of the common elements of C and B. To know the probability of f to be in C or B we need to use:

element_size(fuzzy_set, "f")
#   elements size probability
# 6        f    0   0.1945921
# 7        f    1   0.6027532
# 8        f    2   0.2026547

Here we can see that there is a 0.6 probability that f is in either of them. (If f would be in other sets we need to first filter to only those: fuzzy_set %>% filter(sets %in% c("B", "C")) %>% element_size(element = "f"), in this case the result is the same.)

The standard operator definitions for fuzzy are defined in the union argument fun, which is set to "max". However, then the definition of the common operation is not the same as it is implemented as you define it as "lower bound of the probability of membership in the larger set". In this case where both of them have the same number of elements involved, which should be picked? This could be a corner case, but as you explained they must be well covered and documented. In addition to that how is the union the "lower bound of the probability", wouldn't it be the higher bound of the probability?

Following your definition what would be the expected result of the union of B and C on the following example? With the previous seed, on this example B has more elements than C but has lower bound of truth value(?):

relations <- data.frame(sets = c(rep("A", 5), "B", "B", "C"),
                          elements = c(letters[seq_len(7)], letters[6]),
                          fuzzy = runif(8))
fuzzy_set <- tidySet(relations)
fuzzy_set

I'm not sure I understand the assumptions of the union operator. But the purpose of this package is to test if fuzzy methods are helpful on sets and pathways tests. I do not know before hand if the fuzzy logic can explain better any system than other methodologies.

The theoretical background of fuzzy logic is broad and beyond my current comprehension. However, on the linked page there is an example (section 9) on how such a systems arise that is pertinent:

Voting semantics is based on the idea that different agents (voters) may coherently judge the same proposition differently. The proportion of agents that accept a proposition φ as true may be seen as a truth value. Without further restrictions this does not lead to a truth functional semantics, but rather to an assignment of probabilities to propositions. But if one assigns a fixed level of skepticism to each agent and imposes some natural conditions that keep the judgments on logically complex statements consistent with those levels, then one can recover min, max, and 1−x as truth functions

On the case of pathways and genes there are several voters (scientists or algorithms) that provide their truth value of the association of a gene in a pathway. This is usually expressed as probabilities "I'm 80% sure that X gene is on Y pathway" or with evidence codes in the case of ontologies like gene ontology (i.e. IEA. This is why I don't make a distinction between vagueness and the probability of an association.

I understand the lack of exposure of this field on the bioinformatics, statistics and experimental fields with this logic. This lack of understanding can slow the adoption of the package but at the same time is one of the strengths as it provides with a new method not used much previously on the field. There are other experimental methods to measure the degree to which a gene is active, mainly sequencing, however then the problem is distinguish between tissues and between cell types specific expression. I build a webpage where I collected an example of the application of BaseSet on this kind of research: https://llrs.github.io/BaseSet_scRNAseq/AUCell.html

annakrystalli commented 4 years ago

Hello @llrs!

Great to see such responsiveness to @arendsee's thorough review. I think, however, that it's best to wait till both reviews are in as @j23414 will end up reviewing a moving a target if not. It will likely be easier to address the comments from both reviews in one go for you also. 👍

Apologies, I should have caught that when you first mentioned it.

llrs commented 4 years ago

Hi @annakrystalli, the changes are already pushed to master. I could create a new branch and move all the commits there but I hope it doesn't causes problems.

If you @j23414 are using pkgreviewer I think you'll be able to review the package at the same commit that @arendsee reviewed:

pkgreviewr::pkgreview_create(pkg_repo = "llrs/BaseSet@7d0f6da", ...)

Otherwise let me know and I'll fix this on my end.

j23414 commented 4 years ago

Hi @llrs, I'll continue my review with your changed version. I wrote a script on my end when I saw you had new commits (automatically pulls new version and runs any Rmd files) but thanks for the pkgreview_create command.

I'm aiming to post by Friday.

annakrystalli commented 4 years ago

Thanks for the comment @llrs, and cool pkgreviewr trick! 😃

Looks like @j23414 is sorted so all good, carry on everyone 🙂

arendsee commented 4 years ago

Thanks for the response @llrs . I'm happy with all the changes. As for fuzzy theory, I just want to be sure that the vignettes document what the operators are doing, especially where the fuzzy operators deviate from the more familiar probabilistic operators. Specifically, in probability theory, P(A or B) = P(A) + P(B) - P(A and B). In contrast, the fuzzy operator F(A or B) = max(F(A), F(B)). Let's wait for @j23414 to finish her review then we can talk about it.

llrs commented 4 years ago

Ok, let's wait. I'm not sure I understand your example, what are A and B on your example? Elements or sets?

arendsee commented 4 years ago

P(A or B) is the probability that A or B is true. In this context, it is the probability that an element is in set A or set B.

j23414 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

BaseSet mostly passes my automated checks. Just some minor edits in the vignettes and the examples. (Expand sections above.) I was getting some strange results with the set_size method (see the Functionality section).

Creating sets that are Tidyverse-friendly makes sense. The Classical-set operators seem clear enough and matches what I've seen in the past. The advanced vignette is mainly focused on classical-sets.

Fuzzy-set operators

For fuzzy-sets, alternative definitions for union/intersect exist. The probability function mentioned by @arendsee is also mentioned in Fuzzy Set Theory (Smithson, 2006) (link to chapter, see page 10, below Table 2.3) as a product operator, which is used extensively in probability theory (statistics). Documenting the fuzzy-set operators as min/max/etc with a link to a paper (Zadeh?) should be sufficient to avoid confusion.

The sets package (see ?sets::fuzzy_logic help file) lets the user choose Zadeh (min/max) or product forms of fuzzy logic.

Just a side-note (not a change request): Instead of "probability", fuzzy-set papers seem to use the term membership or membership value. The Fuzzy vignette example was helpful, classifying cells to different cell types along with a fuzzy score, which sounds like a "confidence score" or "confidence level".

Other vignette comments

The vignette/advanced.Rmd mainly focused on classical sets (not fuzzy), where elements=genes and sets=GO terms or pathways. While I appreciated seeing a larger dataset, I could have used more discussion on why the BaseSet-generated results were important. In figure one, gene ontology is split into BP, CC, and MF. Spelling out their names as "biological pathway", "cellular component", and "molecular function" helps readability. A key takeway from that first figure could be that BP "biological pathway" terms tend to be assigned using IBA evidence which stands for "inferred from biological aspect of ancestor". After going through basic.Rmd and fuzzy.Rmd, I was expecting to see different evidence codes assigned different fuzzy values (measure of confidence), and the maximal value used to assign a gene to a particular pathway. You don't need to add the fuzzy values (just my response to the reading). Counting up number of genes per pathway or per gene ontology term was fine. Basic and Fuzzy vignettes were fine.

My R Session Info

sessionInfo() ``` sessionInfo() #> R version 4.0.0 (2020-04-24) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS High Sierra 10.13.6 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] parallel stats4 stats graphics grDevices utils datasets #> [8] methods base #> #> other attached packages: #> [1] GO.db_3.11.4 org.Hs.eg.db_3.11.4 AnnotationDbi_1.50.0 #> [4] IRanges_2.22.2 S4Vectors_0.26.1 Biobase_2.48.0 #> [7] BiocGenerics_0.34.0 BaseSet_0.0.11 magrittr_1.5 #> [10] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 #> [13] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 #> [16] tibble_3.0.1 ggplot2_3.3.2 tidyverse_1.3.0 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.4.6 lubridate_1.7.9 lattice_0.20-41 assertthat_0.2.1 #> [5] digest_0.6.25 R6_2.4.1 cellranger_1.1.0 backports_1.1.8 #> [9] reprex_0.3.0 RSQLite_2.2.0 evaluate_0.14 httr_1.4.1 #> [13] pillar_1.4.4 rlang_0.4.6 readxl_1.3.1 rstudioapi_0.11 #> [17] blob_1.2.1 rmarkdown_2.3 bit_1.1-15.2 munsell_0.5.0 #> [21] broom_0.5.6 compiler_4.0.0 modelr_0.1.8 xfun_0.15 #> [25] pkgconfig_2.0.3 htmltools_0.5.0 tidyselect_1.1.0 fansi_0.4.1 #> [29] crayon_1.3.4 dbplyr_1.4.4 withr_2.2.0 grid_4.0.0 #> [33] nlme_3.1-148 jsonlite_1.6.1 gtable_0.3.0 lifecycle_0.2.0 #> [37] DBI_1.1.0 scales_1.1.1 cli_2.0.2 stringi_1.4.6 #> [41] fs_1.4.1 xml2_1.3.2 ellipsis_0.3.1 generics_0.0.2 #> [45] vctrs_0.3.1 tools_4.0.0 bit64_0.9-7 glue_1.4.1 #> [49] hms_0.5.3 yaml_2.2.1 colorspace_1.4-1 rvest_0.3.5 #> [53] memoise_1.1.0 knitr_1.29 haven_2.3.1 ```
llrs commented 4 years ago

Thanks @j23414 for your detailed review and apologies for modifying the package while you was trying to review it. Some of the modifications I did seem the cause of some of the errors you saw. I'll think more carefully about them and fix them. Many thanks for pointing to unclear explanations and your comments on the fuzzy-set operators and the vignettes.

j23414 commented 4 years ago

@llrs about modifying, no worries. : ) You were coding quickly

arendsee commented 4 years ago

The following statement in the README is not right: "On fuzzy sets, elements have a probability to belong to a set".

In a probabilistic framework, every element is in a set or not in it. We are uncertain which sets an element is in, but the reality is still binary. In a fuzzy framework, the reality is vague. For example, say we are interested in which genes are highly expressed in a tissue. We could transform RNA-seq data for expression levels to a 0-1 scale, where 1 indicates the gene is very highly expressed, 0.6 is fairly highly expressed, 0.3 is weakly expressed, etc. The reality here is not binary.

If you use fuzzy methods on probabilistic data, as you do in the README and the fuzzy vignette, the results will be wrong except in very special cases. The probability that a particular element is in A or B is P(A) + P(B) - P(A and B). This is a law of probability. Application of the fuzzy operator to probabilistic terms would imply that P(A or B) = max(P(A), P(B)), which is correct only when P(B) = P(A and B) or P(A) = P(A and B). In other words, it is true only when one set contains the other set.

As @j23414 mentioned, the probability function (assuming independence) can be defined to work over fuzzy values (Smithson, 2006). Take the statement, "I am tall OR strong". Now say my tall stat is 0.3 and my strong stat is 0.6. The truth value of this statement is 0.6, using the max operator. This intuitively seems reasonable. Similarly, it makes intuitive sense that the statement, "I am tall AND strong" has a truth value of 0.3. We can do similar calculations using the probability function where, assuming independence, we get 0.6 + 0.3 - 0.3 * 0.6 = 0.72. This is a different truth value because we've redefined the fuzzy space by using a different operator. This different definition of fuzziness has some nice properties, so we may choose to use it for certain applications.

However, we don't get to define our own laws of probability. If we are given probabilistic data, then we have to use probabilistic operators. We can't freely change the data from membership probabilities to fuzzy truth values; these mean different things.

These distinctions need to be VERY clear in the README and vignette. If they aren't clear, a careless user could easily assume the operators are probabilistic and get completely wrong results.

annakrystalli commented 4 years ago

Thank you @arendsee and @j23414 for your super timely and thorough reviews! Over to you now @llrs.

llrs commented 4 years ago

@arendsee Perhaps that sentence on the README is not suited for some logics or types of fuzzy sets. I am struggling to introduce the topic for people not familiar with fuzzy sets and at the same time support and explain what the package can do to more expert users. I'll carefully review the wording around the subject.

I agree that it is different to work with membership functions (Gene A 0.3 RNA levels means low, 0.6 means highly expressed) than with probabilities (Gene A is on 60% of cases highly expressed, whatever its value is). However, assigning a set from a measure is, nowadays, out of the scope of the package, this is either provided by the source of data or something the user should do.

About probabilities of A or B, it is still unclear to me over what do you calculate that probability or when would you like to do this. When I'll post a longer reply I'll provide some examples and questions about this, as I tend to understand something better with a practical question or exercise. On your example tall OR strong, you don't use your previous definition that "the probability that an element is in set A or set B". What other elements are present on the height and strength set (and what are their probabilities)? And why do you use height and strength stat as probability or truth value? They are different things, that's why membership functions are used, or are they already the fuzzy values?

While users should not use different logics on the same data the package aims to allow them to use whatever logic they want according to the knowledge they have about the data. But I can't guide them or provide feedback.

The fuzzy set theory extends the probability framework as explained on a source @j23414 provided:

... fuzzy researchers have gone to great pains to distance themselves from probability. But in so doing, many of them have lost track of another point, which is that the converse DOES hold: all probability distributions are fuzzy sets! As fuzzy sets and logic generalize Boolean sets and logic, they also generalize probability.


Hope to correct the points raised by @j23414 and post about the set theory and fuzzy-set operators that both reviewers had questions by the end of the week.

j23414 commented 4 years ago

Hmm, that quoted source is new to me (Bezdek, 1993)... I provided links to the set theory chapter (Smithson 2006), membership value definition, and confidence value definition.

No matter : ) thanks for sharing the source! There definitely seem to be some murkiness to when and where is appropriate to use Fuzzy Sets.

For @arendsee's tall OR strong example, he seems to be illustrating the different calculation and final value in:

Case 1) if tall (0.3) and strong (0.6) are treated as fuzzy truth values: then tall OR strong = max (tall, strong) = max(0.3, 0.6) = 0.6

Case 2) if tall (0.3) and strong (0.6) are treated as probabilities (and assume independence): ** then tall OR strong = P(tall) + P(strong) - P(tall and strong) = 0.3 + 0.6 - (0.3 x 0.6) = 0.72

**You can also think of this as a tall-ness gene where you are "more likely to be tall, but not guaranteed to be tall"

The union of probabilities (Case 2) is usually explained and illustrated as a venn diagram (link with explanation here).

arendsee commented 4 years ago

@arendsee Perhaps that sentence on the README is not suited for some logics or types of fuzzy sets. I am struggling to introduce the topic for people not familiar with fuzzy sets and at the same time support and explain what the package can do to more expert users. I'll carefully review the wording around the subject.

The first sentence in the README does not apply to the specific logic you are using in the package. That operator you are using will not return membership probabilities. Using that operator to infer union membership probability will give you a meaningless result.

About probabilities of A or B, it is still unclear to me over what do you calculate that probability or when would you like to do this.

I'm referring to the union operator as it is used in your fuzzy vignette:

set.seed(4567) # To be able to have exact replicates
relations <- data.frame(sets = c(rep("A", 5), "B", "C"),
                          elements = c(letters[seq_len(6)], letters[6]),
                          fuzzy = runif(7))
fuzzy_set <- tidySet(relations)
> fuzzy_set
  elements sets     fuzzy
1        a    A 0.2309186
2        b    A 0.7412554
3        c    A 0.1833834
4        d    A 0.4221682
5        e    A 0.5996399
6        f    B 0.2773313
7        f    C 0.7307312
> BaseSet::union(fuzzy_set, sets = c("C", "B"))
  elements sets     fuzzy
1        f  C∪B 0.7307312

The union function is intended to calculate the probability that f is in C or B.

P(f in B) = 0.277 P(f in C) = 0.731

Now, from the laws of probability: P(f in (B or C)) = 0.277 + 0.731 - P(f in (B and C)) If we assume independence (which is VERY dicey), then P(f in (B or C)) = 0.277 + 0.731 - 0.277 * 0.731

Your union function calculates the wrong probability.

If you want to use the current logic, then you need to strongly state that the fuzzy values are NOT probabilities.

For @arendsee's tall OR strong example, he seems to be illustrating the different calculation and final value in:

Case 1) if tall (0.3) and strong (0.6) are treated as fuzzy truth values: then tall OR strong = max (tall, strong) = max(0.3, 0.6) = 0.6

Case 2) if tall (0.3) and strong (0.6) are treated as probabilities (and assume independence): ** then tall OR strong = P(tall) + P(strong) - P(tall and strong) = 0.3 + 0.6 - (0.3 x 0.6) = 0.72

@j23414 I think I didn't explain my point here clearly. The tall or strong example was purely fuzzy. The data is fuzzy and probabilistic methods do not apply. I was describing two options for operators over fuzzy data. The first is the "standard" fuzzy operator that is currently used in BaseSet. The second is another fuzzy operator that happens to be the same as the probability of independent events, but the interpretation is different. Specifically, 0.72 is not the probability that the statement, "I am tall or strong", is true, rather it is a measure of how true the statement is. These two cases show two artificial metrics for fuzziness.

My point was that there are different fuzzy union operators. One of the fuzzy operators looks like the probabilistic operator for independent events. But just because this one operator can be used for set membership (assuming independence), does not mean any fuzzy operator can be used. There are many fuzzy metrics, but only one correct probability.

We don't get to choose whether data we are given is probabilities or fuzzy truth values.

While users should not use different logics on the same data the package aims to allow them to use whatever logic they want according to the knowledge they have about the data. But I can't guide them or provide feedback.

The purpose of the README and vignettes is to guide them. But your examples, both in the README and vignettes, incorrectly use fuzzy operators to infer the probabilities of memberships in a set union.

The fuzzy set theory extends the probability framework as explained on a source @j23414 provided:

... fuzzy researchers have gone to great pains to distance themselves from probability. But in so doing, many of them have lost track of another point, which is that the converse DOES hold: all probability distributions are fuzzy sets! As fuzzy sets and logic generalize Boolean sets and logic, they also generalize probability.

Yes, fuzzy set theory is MORE general than probability. This means that probability is a special case of fuzzy logic. Thus a specific logic within fuzzy set theory applies to probabilistic systems. This does not mean that probability theory equals fuzzy theory. And it emphatically does not mean that any fuzzy logic applies to probability. Specifically, the fuzzy operator you are using for unions is not the same as the probability operator.

in probability theory: P(A or B) = P(A) + P(B) - P(A and B) in standard fuzzy theory: F(A or B) = max(F(A), F(B))

P(A or B) is not equal to F(A or B), therefore you simply can't use that operator to calculate the probability of membership in a set union.

Maybe we should invite a statistician and/or a fuzzy set theorist to comment.

annakrystalli commented 4 years ago

Hello all. It looks like some guidance from a statistician might indeed be a good idea. This is technically not part of package review but I'll see what other editors think and let you know what we can do.

llrs commented 4 years ago

Review 2:

I'll only reply to comments or points that weren't checked or there were some substantial comments:

ecoli_sets %>%
      mutate(
      fuzzy = case_when(sets == "GL" ~ 0.2,    # Add fuzzy values
      sets == "CF" ~ 0.8)) %>%
      set_size()

Yes, the GL has 0.04 probability to have two genes so the glycolysis pathway probably don't have any genes. That's is if we assume that the fuzzy values are probabilities, if they where other things it wouldn't be accurate. I think this is different from cardinality, as cardinality it is just a number for a set, while here I return several values for a single set. I added to the documentation how is this calculated (via length_set).

I provided a new method cardinality to calculate it and allow to provide a fuzzy logic function. The {sets} package mentioned previously provides cardinality as the number of elements in a set regardless of the fuzzy value. However, in this case the default is sum as you found, but length as in {sets} or other functions could be provided.

Changed the automatic names for the full names, also added some links for references to the data and how to interpret it. I tried to use some fuzzy values on the advanced vignette, but it would take too long to calculate for many sets and would make the package fail the checks.


Fuzzy logic & explanation

Many thanks for all the feedback on this, I realized that I assumed too many things. I modified or added explanations about the fuzzy values around the package.

I modified the README section about fuzzy sets, now it reads : "Fuzzy sets are similar to classical sets but there is some vagueness on the relationship between the element and the set.".

Similarly I've added a new section on the README about why this package was developed and why it could be useful to others. Also introduced the topic on the fuzzy vignette about what are they and some links.

I do not want to force an interpretation of the fuzzy theory which sometimes translate to using different semantics: membership, probabilities, truth vales, and others. I think that the fuzzy column is the most neutral and better understood. There seems to be a movement from fuzzy research to differentiate it from probabilities, but at the same time to expand them(probabilities).

One can use any logic according to the specific framework one wish to work. This in encouraged through the package by being able to use any logic. To show the flexibility of the package using the same data I've added some examples with a different logic explaining when this logic could be applied. Hope now it is clear that the package does not forces to use a certain logic.

The defaults parameters of the package are meant to provide sensible results with fuzzy sets but to not give surprises on classical sets. On the specific case of union I added an example on the documentation that I hope it makes it clear how to use other logics outside the default for the specific case of probabilities, union(TS, sets = c("A", "B"), FUN = function(x){y <- sum(x)-prod(x); ifelse(length(x)==1, x, y)}).

Given all the discussion around how to interpret the fuzzy sets, I would appreciate your thoughts on when do fuzzy sets arise on practical cases. How do imagine a case where there are multiple cases which might be related or not to a set? For instance imagine I want to record some cards on an intersection; the first blue car goes right, second blue car goes right, third blue car goes left. Would you record as a fuzzy/probability set (set being here right/left) where a blue car has 0.66 relationship with going right and 0.33 going left? Or as a classical set with all three cars two of them on the right set and one on the left set? If it is a fuzzy set, then which logic would you apply?

arendsee commented 4 years ago

The term sum(x) - prod(x) is only equal to the probability of the union when x has length 2. Calculating the union probabilities is more complicated for more than 2 sets. For 3 sets it is P(A or B or C) = P(A) + P(B) + P(C) - P(A and B) - P(B and C) - P(C and A) + P(A and B and C). I recommend adding test cases for sets of 2, 3 and 4 (at least).

Think about the case of three sets with probability of membership for each being 1. The probability of the union would also be 1. But sum(x) - prod(x) == 3 - 1 == 2, which of course isn't a valid probability. In general, the union of any vector of numbers between 0 and 1 should also be between 0 and 1 (you could write this into a unit test).

I don't think it is necessary for initial acceptance into rOpenSci, but you should eventually consider how to relax the independence assumption. I'd definitely recommend talking to a statistician (statisticians are great).

I think fuzzy sets have lots of applications within bioinformatics. For example, someone might want to know whether metabolite A and metabolite B both have high concentrations in a given tissue. You could scale the molar concentrations to values between 0 and 1 where 0.5 is the "normal" value.

llrs commented 4 years ago

@arendsee I added a helper function union_probability to calculate the union probabilities for an unlimited size (only tested for 3 and 4 though). However I feel that this is out of the scope of the package.

The validation of the TidySet class includes a test that check that fuzzy values are between 0 and 1 (both included). There is no need to add a unit test as it is automatically checked with each exported method (via validObject).

As previously said I don't assume independence or dependence of sets in the package, the package doesn't even assume that the fuzzy sets are probabilities! This is and will be the responsibility of the user, mainly because from the data stored the package can't make any educated guesses about the type of the relationships between sets. I'll ask a statistician about relaxing independence between sets.

I agree that they have lots of applications within bioinformatics, that's why I wrote the package: to make it easier to use them. Thank you for all the feedback to make it even easier to use it.

What do you think of the questions on the last paragraph on my previous comment?

llrs commented 3 years ago

Hi @annakrystalli, @j23414, @arendsee. Is there anything else I can do to proceed forward?

annakrystalli commented 3 years ago

Hi @llrs I believe you had requested a response to one of your questions by @arendsee ?

If both @arendsee and @j23414 are happy with your responses to their reviews, I am happy to conclude this review

llrs commented 3 years ago

Yes, along the review I made several questions which I didn't get an answer yet.

Also I don't know what was decided about reaching a statistician (from above):

Hello all. It looks like some guidance from a statistician might indeed be a good idea. This is technically not part of package review but I'll see what other editors think and let you know what we can do.