ropensci / BaseSet

Provides classes for working with sets
https://docs.ropensci.org/BaseSet
Other
10 stars 3 forks source link

rOpenSci reviewer 2: `set_size` was having weird behavior #46

Closed llrs closed 4 years ago

llrs commented 4 years ago

Created a toy example of E. coli.

ecoli_genes <- paste("gene", c(1:4401), sep = ":") # 4401 E. coli genes
ecoli_sets <-
list(GL = ecoli_genes[1:4000],     # Glycolysis
     CF = ecoli_genes[3000:4401],  # Carbon fixation
     AF = ecoli_genes[c(1:2000) * 2]) %>% # Ability to fly
tidySet(.)
head(ecoli_sets, n = 3)
#>   elements sets fuzzy
#> 1   gene:1   GL     1
#> 2   gene:2   GL     1
#> 3   gene:3   GL     1

Check union/intersect/negation functions - worked as expected

ecoli_sets %>%
union(., set = c("GL", "CF")) %>%
#  intersection(., set = c("GL", "CL", "AF")) %>%    # could be called intersect
#  subtract(., set_in = "GL", not_in = "AF") %>%
head(., n = 3)
#>   elements  sets fuzzy
#> 1   gene:1 GL∪CF     1
#> 2   gene:2 GL∪CF     1
#> 3   gene:3 GL∪CF     1

(Optional change) Could change parameter set = c("GL", "CL" to the alternative sets = ... and match the tidySet header elements, sets, fuzzy.

set_size(ecoli_sets)
#>   sets size probability
#> 1   GL 4000    1
#> 2   CF 1402    1
#> 3   AF 2000    1

Fuzzy sets - some weird behavior Noticed some weird behavior in set_size function.

# For the fuzzy test, use a smaller dataset
(ecoli_sets <- list(GL = ecoli_genes[1:2],
     CF = ecoli_genes[2:4]) %>%
tidySet(.))
#>   elements sets fuzzy
#> 1   gene:1   GL     1
#> 2   gene:2   GL     1
#> 3   gene:2   CF     1
#> 4   gene:3   CF     1
#> 5   gene:4   CF     1

set_size(ecoli_sets)  # Not fuzzy size
#>   sets size probability
#> 1   GL    2    1
#> 2   CF    3    1

(ecoli_sets %>%
      >  mutate(
   >    fuzzy = case_when(sets == "GL" ~ 0.2,    # Add fuzzy values
     > sets == "CF" ~ 0.8)
   >  ))
#>   elements sets fuzzy
#> 1   gene:1   GL   0.2
#> 2   gene:2   GL   0.2
#> 3   gene:2   CF   0.8
#> 4   gene:3   CF   0.8
#> 5   gene:4   CF   0.8

set_size(ecoli_sets)  # Fuzzy size
#>   sets size probability
#> 1   GL    2    1
#> 2   CF    3    1

Wait ... the fuzzy size shouldn't match the not-fuzzy size ... I'll create the same fuzzy set following the instructions in the README.md and check its size.

(ecoli_sets <- data.frame(sets = c(rep("GL", 2), rep("CL", 3)), 
  > elements = c(ecoli_genes[1:2], ecoli_genes[2:4]),
  > fuzzy = c(rep(0.2, 2), rep(0.8, 3))) %>%
  >tidySet(.))
#>   elements sets fuzzy
#> 1   gene:1   GL   0.2     # <=
#> 2   gene:2   GL   0.2     # <=
#> 3   gene:2   CL   0.8
#> 4   gene:3   CL   0.8
#> 5   gene:4   CL   0.8

set_size(ecoli_sets)   # Fuzzy size, again... this is what I expected
#>   sets size probability
#> 1   CL    00.008
#> 2   CL    10.096
#> 3   CL    20.384
#> 4   CL    30.512
#> 5   GL    00.640
#> 6   GL    10.320
#> 7   GL    20.040   # Note: 0.040 = 0.2 * 0.2 from above 

Any explanation appreciated. Looking at the table, the fuzzy values are being treated as probabilities. Last line of the table shows that GL (glycolysis) has 2 genes with a probability of 0.04 which is equal to 0.2 x 0.2. So the biological interpretation is that "gene:1" is classified as “Glycolysis” 0.2 (20%) of the time? And ditto for "gene:2"? Ergo the most likely outcome is that “Glycolysis” has 0 genes with a probability of 0.64 and “Carbon fixation” has 3 genes with probability of 0.512. Hmm, I'll think on the interpretation of that probability column, but seems to matches the example in the Fuzzy vignette. Searched for papers on fuzzy-set size/cardinality that used probabilities, but all searches redirected me to a sum of fuzzy values (0.2 + 0.2). Adding a description on how the probability column is calculated would be sufficient documentation to avoid confusion.

llrs commented 4 years ago

Your first try to modify the set and convert to a fuzzy set you'll need to assign the modified ecoli_set to be able to use the fuzzy values (or pipe it). The TidySet is of class S4, not S6, and the modifications are not in place.

ecoli_sets %>%
      mutate(
      fuzzy = case_when(sets == "GL" ~ 0.2,    # Add fuzzy values
      sets == "CF" ~ 0.8)) %>%
      set_size()

Yes, the GL has 0.04 probability to have two genes so the glycolysis pathway probably don't have any genes. I think this is different from cardinality, as cardinality it is just a number for a set, while here I return several values for a single set.

I'm not sure I follow on this cardinality. For example the {sets} package mentioned provides cardinality as the number of elements in a set regardless of the fuzzy value. However I will provide a new method to calculate cardinality and allow to provide a fuzzy logic function (lengths as in {sets} or sum as you found, or other functions).