ohdsi-studies / Covid19CharacterizationCharybdis

Characterizing Health Associated Risks, and Your Baseline Disease In SARS-COV-2 (CHARYBDIS) - a study to describe the baseline demographic, clinical characteristics, treatments and outcomes of interest among individuals tested for SARS-CoV-2 and/or diagnosed with COVID-19 compared to Influenza patients from 2017-2018
Apache License 2.0
19 stars 26 forks source link

Execution runStudy with cohort #70

Closed CartWill closed 4 years ago

CartWill commented 4 years ago

I tried to execute the program but keep getting stuck

**********************************************************
 ---- Copy cohorts to main table ---- 
**********************************************************
Copy and censor cohorts to main analysis table
  |                                                                    |   0%
Error: Error executing SQL:
Not enough resources for query planning - too many subqueries or query is too complex.

The underlying SQL looks like it's trying to do 1,275 union all statements in a CTE. Apparently BigQuery has a limit of 1000 for unions within a cte.
If I re-write the query so all of the unions are inside a big "from" (i.e.: SELECT * INTO #TempTable FROM (the_long_list_of_unions) a), then I can get it to complete. I'm not sure how to fold that into the program though.

anthonysena commented 4 years ago

Hi @CartWill - I think I need to pull in the BigQuery SMEs on this one. Let me reach out to them to see if they can provide some guidance. I seem to recall they had some approaches to deal with this limitation.

konstjar commented 4 years ago

Hi @anthonysena @CartWill It looks like you met this type of limits from this page:

Maximum number of tables referenced per query — 1,000

But I may be wrong. I asked this question to GCP team.

konstjar commented 4 years ago

@anthonysena I got confirmation from GCP team that this is a hard limit -- for the query engine. This was not part of the DDL limit removal. Suggestion it to break this up into separate jobs and then fuse the results.

@CartWill is it a problem you meet in this analysis only?

CartWill commented 4 years ago

@konstjar @anthonysena It just happens with the runStudy with the "strata" group. I also tried with the "feature" group and it also fails. It looks like the SQL being generated is invalid:

Error executing query job. Message: Syntax error: Expected ")" but got keyword SELECT at [5:1]

SQL:
create table temp_covid_scratch.kivdo23ntarget_strata_xref 
           as with data as (

select na target_id, na strata_id, na cohort_id, 'NA' cohort_type
select  target_id,  strata_id,  cohort_id, '' cohort_type
          ) 
           select target_id,strata_id,cohort_id,cohort_type
           from data
anthonysena commented 4 years ago

@CartWill - something seems a bit off in that I would have expected the inner query to contain the list of targetStrataXref contents as a list of UNION'ed SELECT statemented per the code here.

Could you share the parameters you are using to call the runStudy function? I'm particularly interested if you are overriding these default parameters:

cohortIdsToExcludeFromExecution = c(),
cohortIdsToExcludeFromResultsExport = NULL,
cohortGroups = getUserSelectableCohortGroups(),
CartWill commented 4 years ago

@anthonysena These are the parameters I'm using. It's basically just the standard params with just our local DBMS information filled in.

oracleTempSchema <- "temp_covid_scratch"

# Details specific to the database:
databaseId <- "UCHealth_OMOP"
databaseName <- "OMOP_COVID19"
databaseDescription <- "COVID19_UCHealth_OMOP"

# Details for connecting to the CDM and storing the results
outputFolder <- file.path("/srv/covid/Charybdis/Output")
cdmDatabaseSchema <- "omop_covid19"
cohortDatabaseSchema <- "omop_covid19_cohort"
cohortTable <- paste0("AS_CHARYBDIS_", databaseId)
cohortStagingTable <- paste0(cohortTable, "_stg")
featureSummaryTable <- paste0(cohortTable, "_smry")
minCellCount <- 10
maxCores <- 4
useBulkCharacterization <- TRUE
cohortIdsToExcludeFromExecution <- c()
cohortIdsToExcludeFromResultsExport <- NULL

      runStudy(connectionDetails = connectionDetails,
               cdmDatabaseSchema = cdmDatabaseSchema,
               cohortDatabaseSchema = cohortDatabaseSchema,
               cohortStagingTable = cohortStagingTable,
               cohortTable = cohortTable,
               featureSummaryTable = featureSummaryTable,
               oracleTempSchema = oracleTempSchema,
               exportFolder = outputFolder,
               databaseId = databaseId,
               databaseName = databaseName,
               databaseDescription = databaseDescription,
               cohortIdsToExcludeFromExecution = cohortIdsToExcludeFromExecution,
               cohortIdsToExcludeFromResultsExport = cohortIdsToExcludeFromResultsExport,
               incremental = TRUE,
               useBulkCharacterization = useBulkCharacterization,
               minCellCount = minCellCount) 
anthonysena commented 4 years ago

@CartWill - I'm going to close this out since you were able to run the package. Not sure if there are any details you'd like to share on this thread in case others using GBQ run into this issue but feel free to do so. If the issue is still a problem, we can re-open and review. Thanks!