Closed carmenOT closed 1 month ago
@ablack3 @edward-burn how did we create our 100k subset?
https://darwin-eu.github.io/CDMConnector/reference/cdmSample.html here samples can be created in the CDM @ablack3 how about making this reproducible through a seed? @edward-burn
Hi @tiozab Thank you for the suggestion. I am collaborating with Carmen on this study, and I am able to use cdmSample to select a random sample of our data source. Are there any updates on setting the seed? The seed is randomly picked each time and I couldn't find a way to set it. I tried setting it before running cdmSample but it didn't work.
@yl3613 the seed is only available as of CDMConnector 1.4 version. to mention on how we do our 100k subset, we save those tables in a separate schema, so we only create them once. you can do the same, or create the subset tables everytime anew before running a study (but that also takes time).
@carmenOT CDMConnector 1.4 needed updated package version from other packages which had slightly different DUS outputs which I had to amend first. Therefore, now just for you (and potentially some others, let's see ;-)), here you find an updated renv.lock and an updated DUS code. It will be compatible with output from the others (I checked that :-)) DUS_1point4.zip Thus, you can go forward running the DUS (in a "seeded" subset :-))
Hi @tiozab,
Here is an idea that should work with older versions of CDMConnector.
library(CDMConnector)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir("synthea-covid19-200k"))
cdm <- cdm_from_con(con, "main", "main")
all_person_ids <- cdm$person %>% pull(person_id) %>% sort()
set.seed(1)
sampled_person_ids <- sample(all_person_ids, 1000)
# resetting the seed and running sample should give the same result
set.seed(1)
sampled_person_ids2 <- sample(all_person_ids, 1000)
all.equal(sampled_person_ids, sampled_person_ids2)
#> [1] TRUE
# subset the cdm with the sampled person ids
cdm <- cdmSubset(cdm, personId = sampled_person_ids)
cdmDisconnect(cdm)
Created on 2024-06-10 with reprex v2.1.0
I've been trying to run a cdm subset using the new DUS_1point4.zip but when running renv::restore() I get a gcc error compiling the old 1.5.1 version of igraph due to libxml2 include errors. It seems to be including igraph from to a dependency so I couldn't exclude it in the restore or hydrate a newer version. igraph doesn't seem to be included in the original DUS renv lockfile.
I also tried the above alternative method for older version of CDM connector on the original DUS code + environment and got:
Generating cohort (1/1) - covid_19) [1h 10m 24.6s]
Error in validateGeneratedCohortSet()
at omopgenerics/R/classCohortTable.R:53:3:
! 2440136 observations outside observation period.
Any ideas on how to progress on one of these two paths would be greatly appreciated.
@steven-opc yes, thank you, the DiagrammeR is a dependency of the new package "cohort characteristics" used in the new code. @edward-burn, who is the maintainer of cohortCharacteristics? the DiagrammeR has igraph 1.5.1 as a dependency, but throws errors with this data partner (SQL Server) when restoring the library. Any suggestions?
@steven-opc with regards to the option of using the original version and the cdm sample as suggested by @ablack3, how big is your random sample? 2.4 million observations outside the observation period is a lot of people.
I am just trying to figure out right now where to investigate further.
@tiozab thanks, I was trying with n=1000 so it seems likely that I'm doing something wrong here. Last line is where I get the error.
all_person_ids <- cdm$person %>% pull(person_id) %>% sort()
set.seed(1)
sampled_person_ids <- sample(all_person_ids, 1000)
cdm <- cdmSubset(cdm, personId = sampled_person_ids)
covid <- CDMConnector::readCohortSet(
here::here("json_cohort"))
cdm <- CDMConnector::generateCohortSet(cdm,
covid,
name = "covid",
overwrite = TRUE)
In terms of the overall output I'm trying to run on a sample to cover the gap of not being able to generate the penicillin v drug cohort on the full database. Everything else worked when I told it to skip generating that particular cohort in the prevalent drug cohorts step. When trying to run the penicillin v cohort would always give me a R out of memory error + SQL communication link failure and drop the SQL connection making the cdm object unusable, regardless of increasing the analysis virtual machine's RAM or running on the database server itself. So I can submit all of the other results based on the full database.
thanks @steven-opc I would suggest to upload the results (even without the penicillin v), maybe going back to the mapping of that drug. I really cannot make sense of it either.
Hello team, as we talked about during the megastudy call this morning, I was wondering if you have any recommendations on how to create a random subset for a large data source to run the study. Additionally, I would like to know how you would ensure the random selection is reproducible in the future and if you could do all the steps directly at cdm connection level.
Thanks