ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

presubmission: BaseSet #339

Closed llrs closed 5 years ago

llrs commented 5 years ago

Submitting Author: Lluís Revilla Sancho (@llrs)
Repository: llrs/BaseSet


Package: BaseSet
Title: Provides classes for working with sets
Version: 0.0.0.9003
Authors@R: 
    person(given = "Lluís ",
           family = "Revilla Sancho",
           role = c("aut", "cre"),
           email = "lluis.revilla@gmail.com")
Description: A set collection, while not "tidy" in itself, can
    be thought of as three tidy data frames describing sets, elements and
    relations respectively. 'BaseSet' provides an approach to manipulate,
    load and use these virtual data frames.
License: MIT + file LICENSE
URL: https://github.com/llrs/BaseSet
BugReports: https://github.com/llrs/BaseSet/issues
Depends: 
    R (>= 3.5.0)
Imports: 
    dplyr (>= 0.7.8),
    magrittr,
    methods,
    rlang,
    XML
Suggests: 
    BiocStyle,
    covr,
    forcats,
    ggplot2,
    GO.db,
    GSEABase,
    knitr,
    org.Hs.eg.db,
    reactome.db,
    rmarkdown,
    spelling,
    testthat (>= 2.1.0),
    Biobase
VignetteBuilder: 
    knitr
Encoding: UTF-8
Language: en-US
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 6.1.1
Collate: 
    'AllClasses.R'
    'AllGenerics.R'
    'GMT.R'
    'GeneSetCollection.R'
    'activate.R'
    'add.R'
    'add_column.R'
    'add_relation.R'
    'adjacency.R'
    'arrange.R'
    'basesets-package.R'
    'cartesian.R'
    'complement.R'
    'data_frame.R'
    'deactivate.R'
    'droplevels.R'
    'elements.R'
    'filter.R'
    'fuzzification.R'
    'group.R'
    'group_by.R'
    'head.R'
    'incidence.R'
    'independent.R'
    'operations.R'
    'intersection.R'
    'length.R'
    'list.R'
    'move_to.R'
    'mutate.R'
    'names.R'
    'naming.R'
    'nested.R'
    'obo.R'
    'power_set.R'
    'print.R'
    'pull.R'
    'relations.R'
    'remove.R'
    'remove_column.R'
    'rename.R'
    'select.R'
    'set.R'
    'size.R'
    'subtract.R'
    'tidy-set.R'
    'union.R'
    'utils-pipe.R'
    'xml.R'
    'zzz.R'

Scope

The package implements methods to work on sets, doing intersection, union, complementary and other set operations in a "tidy" way. It also allows to import from several formats used in the life science world. Like the GMT and the GAF or the OBO format file for ontologies.

I am not sure if this can be considered data munging as it itself then delegates the methods to filter, mutate, select, ... to dplyr. It does work with the organization of the data and how to continue it.

The idea is to use the package for working with sets and signatures of genes in scRNAseq or in pathways and ontologies but it might work with other fields.

There is the sets package which implements a more generalized approach, that can store functions or lists as an element of a set (while mine it only allows to store a character or factor), but it is harder to operate in a tidy/long way. Also the operations of intersection and union need to happen between two different objects, while TidySet objects (the class implemented in BaseSet) can store a single set or thousands of them. In BaseSet is easier to operate and implement new fuzzy logic operations. It is developed openly on github compared to sets which I couldn't track how it is being developed.

The GSEABase partially implements this, but it doesn't allow to store fuzzy sets and it is also quite slow as it creates several classes for annotating each set.

There is also the hierarchicalSets package that is focused on clustering of sets that are inside other sets and visualizations. However, BaseSet is focused on storing and manipulate sets including hierarchical sets.

It has been developed in dialogue from the Bioconductor team and community through slack. And was presented as a work in progress in the conference Bioc2019 along some other implementations. Among those other implementations the Bioconductor team is also developing another package which has followed some of the principles that guided the development of this package, through a google document file.

There is also the unisets package which explored the same ideas following also the document.

Before I submit it officially (if it is on topic), I'll like to do some minor clean up, remove some code that I think that are not needed (parsing XML files), change the labels, improve some documentation (including adding the code of conduct I thought it already had it), check some bottleneck on some functions.

melvidoni commented 5 years ago

Hello @llrs, and thanks for submitting your enquiry!

We discussed this with the editors, and found this to be an edge case in data munging, but we lean towards accepting. Quoting the policy: "This area does not include broad data manipulations tools such as reshape2 or tidyr, but rather tools for handling data in specific scientific formats". Considering this, where do you think it falls?

Please, answer here, and consider that as well for when you submit your package to rOpenSci. Looking forward to a full submission.

llrs commented 5 years ago

I think it fails under the policy of "Packages for processing data from formats above". But there isn't a consensus on which is the format for storing this kind of data that could be parsed/retrieved, so I just implemented several forms of processing the data.

melvidoni commented 5 years ago

That is great @llrs, thanks for the clarification. Go ahead and make a full submission then!