Closed CartWill closed 4 years ago
Hi @CartWill - I think I need to pull in the BigQuery SMEs on this one. Let me reach out to them to see if they can provide some guidance. I seem to recall they had some approaches to deal with this limitation.
Hi @anthonysena @CartWill It looks like you met this type of limits from this page:
Maximum number of tables referenced per query — 1,000
But I may be wrong. I asked this question to GCP team.
@anthonysena I got confirmation from GCP team that this is a hard limit -- for the query engine. This was not part of the DDL limit removal. Suggestion it to break this up into separate jobs and then fuse the results.
@CartWill is it a problem you meet in this analysis only?
@konstjar @anthonysena It just happens with the runStudy with the "strata" group. I also tried with the "feature" group and it also fails. It looks like the SQL being generated is invalid:
Error executing query job. Message: Syntax error: Expected ")" but got keyword SELECT at [5:1]
SQL:
create table temp_covid_scratch.kivdo23ntarget_strata_xref
as with data as (
select na target_id, na strata_id, na cohort_id, 'NA' cohort_type
select target_id, strata_id, cohort_id, '' cohort_type
)
select target_id,strata_id,cohort_id,cohort_type
from data
@CartWill - something seems a bit off in that I would have expected the inner query to contain the list of targetStrataXref contents as a list of UNION'ed SELECT statemented per the code here.
Could you share the parameters you are using to call the runStudy
function? I'm particularly interested if you are overriding these default parameters:
cohortIdsToExcludeFromExecution = c(),
cohortIdsToExcludeFromResultsExport = NULL,
cohortGroups = getUserSelectableCohortGroups(),
@anthonysena These are the parameters I'm using. It's basically just the standard params with just our local DBMS information filled in.
oracleTempSchema <- "temp_covid_scratch"
# Details specific to the database:
databaseId <- "UCHealth_OMOP"
databaseName <- "OMOP_COVID19"
databaseDescription <- "COVID19_UCHealth_OMOP"
# Details for connecting to the CDM and storing the results
outputFolder <- file.path("/srv/covid/Charybdis/Output")
cdmDatabaseSchema <- "omop_covid19"
cohortDatabaseSchema <- "omop_covid19_cohort"
cohortTable <- paste0("AS_CHARYBDIS_", databaseId)
cohortStagingTable <- paste0(cohortTable, "_stg")
featureSummaryTable <- paste0(cohortTable, "_smry")
minCellCount <- 10
maxCores <- 4
useBulkCharacterization <- TRUE
cohortIdsToExcludeFromExecution <- c()
cohortIdsToExcludeFromResultsExport <- NULL
runStudy(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortStagingTable = cohortStagingTable,
cohortTable = cohortTable,
featureSummaryTable = featureSummaryTable,
oracleTempSchema = oracleTempSchema,
exportFolder = outputFolder,
databaseId = databaseId,
databaseName = databaseName,
databaseDescription = databaseDescription,
cohortIdsToExcludeFromExecution = cohortIdsToExcludeFromExecution,
cohortIdsToExcludeFromResultsExport = cohortIdsToExcludeFromResultsExport,
incremental = TRUE,
useBulkCharacterization = useBulkCharacterization,
minCellCount = minCellCount)
@CartWill - I'm going to close this out since you were able to run the package. Not sure if there are any details you'd like to share on this thread in case others using GBQ run into this issue but feel free to do so. If the issue is still a problem, we can re-open and review. Thanks!
I tried to execute the program but keep getting stuck
The underlying SQL looks like it's trying to do 1,275 union all statements in a CTE. Apparently BigQuery has a limit of 1000 for unions within a cte.
If I re-write the query so all of the unions are inside a big "from" (i.e.: SELECT * INTO #TempTable FROM (the_long_list_of_unions) a), then I can get it to complete. I'm not sure how to fold that into the program though.