Quota limits in GBQ while generating LEGEND cohorts

eminty commented 2 years ago

Error:
java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Quota exceeded: Your table exceeded quota for imports or query appends per table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

SQL:
create table j6p09yq2codesets (
  codeset_id INT64 not null,
  concept_id INT64 not null
)

Have benefitted from discussions with @jdposada , @msuchard , @ and others. May be of relevance to @chrisknoll and @mgkahn.

This is from an attempt using this docker image (v 0.6) deployed in a GCP VM.

From looking at the assessment.log, this was on the 112th cohort instantiation (Cohort id 201201000). Each cohort is generated using ATLAS generated SQL. E.g. the one where the error was hit is this one.

Each of those SQL scripts has a series of insert statements. Some of these act on a newly created #codesets table (generated for each cohort), and others that insert the cohort into the cumulative study cohort table, and track inclusion / summary stats. Each of those is preceded by a delete (although this started with a clean project, so there shouldn’t have been a need to delete tables).

All of this relevant because it seems like the relevant limit is either table operations / day (1500) (which includes inserts or deletes) or possibly tables.insert method requests / 10 seconds (and tables.insert here is possibly inclusive of CREATE? – wondering here because of the snippet that’s blamed).

Some solutions being explored include 1) Looking into raising the relevant quota (preferred - but doesn't necessarily address such problems in the future). 2) Staging the execution somehow (it doesn't look like inclusion stats can be deferred as they are implicit to the SQL for each cohort, but perhaps cohort creation can proceed in batches according to the quota limit).

Discussion as to whether there's a longer term SQLRender approach to this problem, deferred to those more knowledgeable than I :)

errorReportR.txt ErrorReportSql.txt assessmentLog.txt

konstjar commented 2 years ago

@eminty

Is it possible to check what version of BigQuery JDBC Driver you are using? Taking into account that you have DatabaseConnector v4.0.5 - I can assume the versions of OHDSI R packages are quite old. Can you try to run the analysis on the latest HADES version using latest JDBC version of driver?

eminty commented 2 years ago

Thanks @konstjar ,

The BQ JDBC driver I used was 1.2.21.1025, which was current as of January. I see they are up to 1.2.22.1026; I could try that if we think there have been some material changes.

Database connector v4.0.2 is used as per this renv.lock; based closely on the original. It has the other OHDSI dependencies / versions. (The only difference is that it references a branch of cohortDiagnostics to address a BQ / permissions issue with getTableNames() when applied to the cdm schema).

@msuchard would know better than I if there's reasons for using this particular version of HADES in LEGEND.

I could look to run with the latest HADES version and driver. It pushes us a bit further away from reproducibility in the code base used for the study, and I don't know if it's going to generate other dependency issues, given that they are also specified in the lock file.

When you say 'latest version of HADES', is there a specification (esp. in the form of an renv.lock file) of all those packages and dependencies that can be referenced?

I gather from your suggestion that you don't expect there is actually a quota issue here?

Happy to have the opportunity to learn from you on this. Thanks.

jdposada commented 2 years ago

hi @konstjar,

Echoing what @eminty said it would be better if you provide an renv.lock file that should be used instead of the one used on this Github repository. That significantly reduces the guess work and the possible multiple back and foth for slight dependencies changes and compatibilities.

Thanks a lot for your help

eminty commented 2 years ago

As a (delayed) follow up to this, I wound up engineering a work around Google's quota by substituting the generic 'Codesets' table name within the ATLAS-generated SQL in the inst/sql/sql_server/class folder, with 'CodesetsB'. This prevents the quota limit from being triggered by table operations on 'Codesets'.

I did this by way of a perl script in R. Paths are appropriate to a deployed legendT2dm container. Suggest backing up original sql files first. Then:

cohortSQL <- list.files("/workdir/LegendT2dm/inst/sql/sql_server/class")
n = trunc(length(cohortSQL)/2)
cohortSQL_toChange = cohortSQL[1:n]

setwd("/workdir/LegendT2dm/inst/sql/sql_server/class")
for (i in 1:length(cohortSQL_toChange)){
  perl_script <- paste0("perl -pi.bak -e 's/Codesets/CodesetsB/g' ",cohortSQL_toChange[i])
  system(perl_script)
}

(note that this also creates an additional back up of any original SQL files with a .bak extension; modified from this example).

While not ideal to be creating a divergence in the code base across sites (including the cohort SQL files), this did allow us to run the cohort characterization phase of LEGENDT2DM.

I suspect that, instead of needing to do this for other computational epidemiology experiments at large scale (an important part of the OHDSI mission!), we would want to discuss this quota limit with those at google interested in seeing OHDSI thrive in a Big Query dbms. If that doesn't seem possible, those more knowledgeable than I could alternatively consider how ATLAS generated SQL might be able to be translated for BQ to conform to the the limit..

mgkahn commented 2 years ago

@eminty I've forwarded your posting above to my GBQ contacts.......

ohdsi-studies / LegendT2dm

Quota limits in GBQ while generating LEGEND cohorts #20