ropensci / rsnps

Wrapper to a number of SNP web APIs
https://docs.ropensci.org/rsnps
Other
52 stars 22 forks source link

Unit test fails for allphenotypes() #72

Closed sinarueeger closed 4 years ago

sinarueeger commented 5 years ago

Last test fails for tests/testthat/test-allphenotypes.R (this happens on the master branch)

why does it fail? This is the test function:

expect_that(as.character(allphenotypes()[["ADHD"]][7,3]),
                            equals("Mthfr c677t"))

For some reason, the allphenotypes dataframe has shifted by one row:

allphenotypes()[["ADHD"]][7:8, 3]
## [[1]] ## this is [7]
## [1] "Diagnosed as not having but with some signs"

## [[2]] ## this is [8]
## [1] "Mthfr c677t"

devtools::check error

  Running ‘test-all.R’ [4s/72s]
Running the tests in ‘tests/test-all.R’ failed.
Last 13 lines of output:
  package 'testthat' was built under R version 3.5.2 
  > test_check("rsnps")
  Loading required package: rsnps
  ── 1. Failure: allphenotypes returns the correct value (@test-allpheno
  `x` not equal to `expected`.
  1/1 mismatches
  x[1]: "Diagnosed as not having but with some signs"
  y[1]: "Mthfr c677t"

  ══ testthat results  ═════════════════════════════════════════════════
  OK: 71 SKIPPED: 0 WARNINGS: 0 FAILED: 1
  1. Failure: allphenotypes returns the correct value (@test-allphenotypes.R#19) 

  Error: testthat unit tests failed

Solution

Find out why the order of the dataframe has changed. For example, if there has been a new entry to opensnp.

sinarueeger commented 5 years ago

Research on travis:

jooolia commented 5 years ago

Hi @sinarueeger, I will investigate. There seems to be something slightly bizarre going as I see with the commit history the builds fail fairly frequently. The test now fails in master.

jooolia commented 5 years ago

So is it a case that since the row of the dataframe is hardcoded, if there are new "known_variations" this test may fail? So the quick option would be to update the row as you alluded to above.

I have no idea how frequently these data are updated. If they are frequently updated perhaps we could do more something like: expect_true("Mthfr c677t" %in% allphenotypes()[["ADHD"]][,3]) What issues could there be with this? What are your thoughts?

sinarueeger commented 5 years ago

I tried to investigate this. allphenotypes()[["ADHD"]] provides all possible variations for ADHD. So I tried to figure out if any of the known variations in row 1 to 7 were added recently by new users. But I don't think so.

library(dplyr)
## 0. get the dataframe tested in tes-allphenotypes.R
tmp <- allphenotypes()[["ADHD"]]

## we are only interested in line 1:7 (8th line is "Mthfr c677t", the one we want to test)
tmp %<>% dplyr::rename(variation = known_variations) %>% mutate_if(is.list, unlist)

## 1. figure out, which of these known_variations are rare 
## (and maybe got added recently and messed up order)
adhd_id <- phenotypes_byid(phenotypeid=29, return_ = 'users')
adhd_variation <- adhd_id %>% dplyr::group_by(variation) %>% dplyr::tally() %>% ungroup()

tmp %>% left_join(adhd_variation)
## "Diagnosed as not having but with some signs" is the only one before "Mthfr.."
## that has low count. 

## 2. Investigate users with "Diagnosed as not having but with some signs"

adhd_sel <- adhd_id %>% filter(variation == "Diagnosed as not having but with some signs")

## takes long!!!
## and from these files we don't know WHEN the phenotypes were uplaoded
if (false) files <- purrr::map(adhd_sel$user_id, ~ download_users(id = .x))

## we can guess form the user ids that the user ids > 8000 recently joined. 
## see user ids here: https://opensnp.org/updates

## not really needed, but if we want to know the names of the three users
## adhd_sel$id, here they are: 
data <- users(df = TRUE)
data[[1]] %>% filter(id %in% adhd_sel$user_id)
data[[2]] %>% filter(id %in% adhd_sel$user_id)

To understand how opensnp designs the allphenotypes table we could figure out the mechanism by trial and error (essentially signing up and afterwards deleting our account), but that seems cumbersome to me (we could also contact them by email).

As a tests we could indeed check if any of the known variations are present (however, this can change if a user deletes their file). expect_true("Mthfr c677t" %in% allphenotypes()[["ADHD"]][,3])

We could also test a set of possible values:

test_that("allphenotypes returns set of common known_variations (common = in more than 5 individuals)", {

    expect_true(all(c("False", "True", "Undiagnosed, but probably true", "No", "Yes", 
                    "Not diagnosed", "Mthfr c677t") %in% as.character(allphenotypes()[["ADHD"]][,3]))) 

})

(I am sure this can be done more elegantly...) What do you think is better @jooolia?

Will make a PR with this new test, but we can also replace it with the test you proposed.

jooolia commented 4 years ago

Fixed by in PR #74