steven-opc commented 6 months ago

Hi, I'm getting an error when trying to run the DUS code on our dataset. The feasibility and incidence/prevalence code ran without issue. SQL server, CDM 5.4

v 123 candidate concepts identified Time taken: 2 minutes and 11 seconds i Generating 1 cohort v Generating cohort (1/1) - covid_19) [3h 4m 1.8s] Created a temporary table named #dbplyr_001 Created a temporary table named #dbplyr_002 Created a temporary table named #dbplyr_003 Error: nanodbc/nanodbc.cpp:1769: 01000: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Adding a value to a 'date' column caused an overflow. [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The statement has been terminated.

'SELECT * INTO "results"."ehden_megastudyinfection" FROM (SELECT "cohort_definition_id", "subject_id", TRY_CAST("cohort_start_date" AS DATE) AS "cohort_start_date", TRY_CAST("cohort_end_date" AS DATE) AS "cohort_end_date" FROM ( SELECT "cohort_definition_id", "subject_id", MIN("cohort_start_date") AS "cohort_start_date", MAX("cohort_end_date") AS "cohort_end_date" FROM ( SELECT "q01".*, SUM(CASE WHEN ("prev_start" <= "cohort_start_date" AND "cohort_start_date" <= "prev_end") THEN 0 ELSE 1 END) OVER (PARTITION BY "cohort_definition_id", "subject_id" ORDER BY "cohort_start_date", "cohort_end_date" ROWS UNBOUNDED PRECEDING) AS "groups" FROM ( SELECT "q01".*, COALESCE(min("cohort_

tiozab commented 6 months ago

@steven-opc thank you. have you used the renv for DUS? If yes, can you send me the log from the "storage" folder as well let me know your database name? is it very big? quite strange that the covid cohort took 3 h.

steven-opc commented 6 months ago

yes, I used the renv. database name is OPCRD, 25m patients, about 6TB

log file is: INFO [2024-04-29 15:29:34] CREATE CDM OBJECT INFO [2024-04-29 15:29:59] CREATE SNAPSHOT INFO [2024-04-29 15:30:03] GENERATE INDICATION CONCEPTS INFO [2024-04-29 16:51:00] GENERATE INDICATION COHORTS INFO [2024-04-29 19:55:18] neutropenia INFO [2024-04-29 20:11:55] bacteraemia INFO [2024-04-29 20:12:39] infection

tiozab commented 6 months ago

@steven-opc yes, that is quite a large DB. Infection is also the biggest indication cohort. let's try to remove it and see whether the rest runs smoothly. in the script DUS.R, comment out line 63

"infection",

steven-opc commented 6 months ago

thanks I will try that and let you know

steven-opc commented 6 months ago

it failed again with the same error at the urogenital_surgery stage. Rather than keep excluding cohorts from being generated I went back to the data to try and resolve issues with it. We had some individual event dates recorded in source as 9999-12-31 that I've now changed to 9000-12-31. They're outside the observation period either way. That looks like it's resolved the issue for now as it's generated all of the indication cohorts and has continued, currently on the Generate Incident Drug Cohorts stage. Have fed the problematic records back to our internal data quality manager for a proper fix.

tiozab commented 6 months ago

@steven-opc absolutely, so it was not due to the file getting to big, but the dates not making sense for these big cohorts which had likely patients in there with no good dates. Good catch! if you can identify the patients without good dates you can then subset the cdm to those with the good dates creating a new cdm by filtering for the patient ids with good dates https://darwin-eu.github.io/CDMConnector/reference/cdmSubset.html

steven-opc commented 6 months ago

Typically the patients have one bad record and the rest with good dates so I'm not sure if you'd want to exclude them?

Unfortunately R has errored out with a different error. Is there some way to resume it from where it left off or do I need to restart from the beginning again?

Getting incidence for analysis 1 of 2 Getting incidence for analysis 2 of 2 Error in eval(ei, envir) : nanodbc/nanodbc.cpp:1769: 42000: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]There is insufficient system memory in resource pool 'internal' to run this query. In addition: There were 16 warnings (use warnings() to see them)

last lines in the log file: INFO [2024-05-02 22:06:17] GENERATE INCIDENT DRUG COHORTS INFO [2024-05-02 22:06:17] anakinra INFO [2024-05-02 22:53:59] etanercept INFO [2024-05-02 23:33:12] abatacept INFO [2024-05-03 00:06:36] urokinase INFO [2024-05-03 00:45:21] daunorubicin INFO [2024-05-03 01:22:55] arsenic_trioxide INFO [2024-05-03 02:12:58] alteplase INFO [2024-05-03 02:44:21] imiglucerase INFO [2024-05-03 03:19:17] upadacitinib INFO [2024-05-03 03:51:16] bevacizumab INFO [2024-05-03 04:27:57] cetrorelix INFO [2024-05-03 05:14:12] baricitinib INFO [2024-05-03 06:09:13] agalsidase_beta INFO [2024-05-03 06:50:07] ganirelix INFO [2024-05-03 07:37:48] sarilumab INFO [2024-05-03 08:19:01] meropenem INFO [2024-05-03 08:59:12] amoxicillin

tiozab commented 6 months ago

@steven-opc true, the sampling by person is not ideal. I pass that on. I know it can be quite cumbersome to re-run everything, but I think it is probably faster than taking the code apart and saving steps here and there and to make sure to stitch everything back together in the end. Since you DB is also quite large, we usually suggest running in a subset first to see whether everything runs (however, some errors are only found when running in the whole database, especially if they are to do with rare weird dates). https://github.com/oxford-pharmacoepi/MegaStudy/issues/46 we have a 100k subset with which we test everything before we run the full data (also around 20-30M)

tiozab commented 4 months ago

@steven-opc has this been resolved?

steven-opc commented 4 months ago

the original issue was resolved yes

oxford-pharmacoepi / MegaStudy

date overflow when running DUS #50

"infection",