Create subset for large data sources in the cdm

carmenOT commented 7 months ago

Hello team, as we talked about during the megastudy call this morning, I was wondering if you have any recommendations on how to create a random subset for a large data source to run the study. Additionally, I would like to know how you would ensure the random selection is reproducible in the future and if you could do all the steps directly at cdm connection level.

Thanks

tiozab commented 7 months ago

@ablack3 @edward-burn how did we create our 100k subset?

tiozab commented 7 months ago

https://darwin-eu.github.io/CDMConnector/reference/cdmSample.html here samples can be created in the CDM @ablack3 how about making this reproducible through a seed? @edward-burn

yl3613 commented 6 months ago

Hi @tiozab Thank you for the suggestion. I am collaborating with Carmen on this study, and I am able to use cdmSample to select a random sample of our data source. Are there any updates on setting the seed? The seed is randomly picked each time and I couldn't find a way to set it. I tried setting it before running cdmSample but it didn't work.

tiozab commented 6 months ago

@yl3613 the seed is only available as of CDMConnector 1.4 version. to mention on how we do our 100k subset, we save those tables in a separate schema, so we only create them once. you can do the same, or create the subset tables everytime anew before running a study (but that also takes time).

@carmenOT CDMConnector 1.4 needed updated package version from other packages which had slightly different DUS outputs which I had to amend first. Therefore, now just for you (and potentially some others, let's see ;-)), here you find an updated renv.lock and an updated DUS code. It will be compatible with output from the others (I checked that :-)) DUS_1point4.zip Thus, you can go forward running the DUS (in a "seeded" subset :-))

ablack3 commented 5 months ago

Hi @tiozab,

Here is an idea that should work with older versions of CDMConnector.

library(CDMConnector)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir("synthea-covid19-200k"))
cdm <- cdm_from_con(con, "main", "main")

all_person_ids <- cdm$person %>% pull(person_id) %>% sort()

set.seed(1)
sampled_person_ids <- sample(all_person_ids, 1000)

# resetting the seed and running sample should give the same result
set.seed(1)
sampled_person_ids2 <- sample(all_person_ids, 1000)

all.equal(sampled_person_ids, sampled_person_ids2)
#> [1] TRUE

# subset the cdm with the sampled person ids
cdm <- cdmSubset(cdm, personId = sampled_person_ids)

cdmDisconnect(cdm)

^{Created on 2024-06-10 with reprex v2.1.0}

steven-opc commented 5 months ago

I've been trying to run a cdm subset using the new DUS_1point4.zip but when running renv::restore() I get a gcc error compiling the old 1.5.1 version of igraph due to libxml2 include errors. It seems to be including igraph from to a dependency so I couldn't exclude it in the restore or hydrate a newer version. igraph doesn't seem to be included in the original DUS renv lockfile.

I also tried the above alternative method for older version of CDM connector on the original DUS code + environment and got: Generating cohort (1/1) - covid_19) [1h 10m 24.6s] Error in validateGeneratedCohortSet() at omopgenerics/R/classCohortTable.R:53:3: ! 2440136 observations outside observation period.

Any ideas on how to progress on one of these two paths would be greatly appreciated.

tiozab commented 5 months ago

@steven-opc yes, thank you, the DiagrammeR is a dependency of the new package "cohort characteristics" used in the new code. @edward-burn, who is the maintainer of cohortCharacteristics? the DiagrammeR has igraph 1.5.1 as a dependency, but throws errors with this data partner (SQL Server) when restoring the library. Any suggestions?

@steven-opc with regards to the option of using the original version and the cdm sample as suggested by @ablack3, how big is your random sample? 2.4 million observations outside the observation period is a lot of people.

I am just trying to figure out right now where to investigate further.

steven-opc commented 5 months ago

@tiozab thanks, I was trying with n=1000 so it seems likely that I'm doing something wrong here. Last line is where I get the error.

all_person_ids <- cdm$person %>% pull(person_id) %>% sort()
set.seed(1)
sampled_person_ids <- sample(all_person_ids, 1000)
cdm <- cdmSubset(cdm, personId = sampled_person_ids)

covid <- CDMConnector::readCohortSet(
  here::here("json_cohort"))

cdm <- CDMConnector::generateCohortSet(cdm,
                                       covid,
                                       name = "covid",
                                       overwrite = TRUE)

In terms of the overall output I'm trying to run on a sample to cover the gap of not being able to generate the penicillin v drug cohort on the full database. Everything else worked when I told it to skip generating that particular cohort in the prevalent drug cohorts step. When trying to run the penicillin v cohort would always give me a R out of memory error + SQL communication link failure and drop the SQL connection making the cdm object unusable, regardless of increasing the analysis virtual machine's RAM or running on the database server itself. So I can submit all of the other results based on the full database.

tiozab commented 5 months ago

thanks @steven-opc I would suggest to upload the results (even without the penicillin v), maybe going back to the mapping of that drug. I really cannot make sense of it either.

oxford-pharmacoepi / MegaStudy

Create subset for large data sources in the cdm #46