Closed eminty closed 2 years ago
@eminty
Is it possible to check what version of BigQuery JDBC Driver you are using? Taking into account that you have DatabaseConnector v4.0.5 - I can assume the versions of OHDSI R packages are quite old. Can you try to run the analysis on the latest HADES version using latest JDBC version of driver?
Thanks @konstjar ,
The BQ JDBC driver I used was 1.2.21.1025, which was current as of January. I see they are up to 1.2.22.1026; I could try that if we think there have been some material changes.
Database connector v4.0.2 is used as per this renv.lock; based closely on the original. It has the other OHDSI dependencies / versions. (The only difference is that it references a branch of cohortDiagnostics to address a BQ / permissions issue with getTableNames() when applied to the cdm schema).
@msuchard would know better than I if there's reasons for using this particular version of HADES in LEGEND.
I could look to run with the latest HADES version and driver. It pushes us a bit further away from reproducibility in the code base used for the study, and I don't know if it's going to generate other dependency issues, given that they are also specified in the lock file.
When you say 'latest version of HADES', is there a specification (esp. in the form of an renv.lock file) of all those packages and dependencies that can be referenced?
I gather from your suggestion that you don't expect there is actually a quota issue here?
Happy to have the opportunity to learn from you on this. Thanks.
hi @konstjar,
Echoing what @eminty said it would be better if you provide an renv.lock file that should be used instead of the one used on this Github repository. That significantly reduces the guess work and the possible multiple back and foth for slight dependencies changes and compatibilities.
Thanks a lot for your help
As a (delayed) follow up to this, I wound up engineering a work around Google's quota by substituting the generic 'Codesets' table name within the ATLAS-generated SQL in the inst/sql/sql_server/class folder, with 'CodesetsB'. This prevents the quota limit from being triggered by table operations on 'Codesets'.
I did this by way of a perl script in R. Paths are appropriate to a deployed legendT2dm container. Suggest backing up original sql files first. Then:
cohortSQL <- list.files("/workdir/LegendT2dm/inst/sql/sql_server/class")
n = trunc(length(cohortSQL)/2)
cohortSQL_toChange = cohortSQL[1:n]
setwd("/workdir/LegendT2dm/inst/sql/sql_server/class")
for (i in 1:length(cohortSQL_toChange)){
perl_script <- paste0("perl -pi.bak -e 's/Codesets/CodesetsB/g' ",cohortSQL_toChange[i])
system(perl_script)
}
(note that this also creates an additional back up of any original SQL files with a .bak extension; modified from this example).
While not ideal to be creating a divergence in the code base across sites (including the cohort SQL files), this did allow us to run the cohort characterization phase of LEGENDT2DM.
I suspect that, instead of needing to do this for other computational epidemiology experiments at large scale (an important part of the OHDSI mission!), we would want to discuss this quota limit with those at google interested in seeing OHDSI thrive in a Big Query dbms. If that doesn't seem possible, those more knowledgeable than I could alternatively consider how ATLAS generated SQL might be able to be translated for BQ to conform to the the limit..
@eminty I've forwarded your posting above to my GBQ contacts.......
Have benefitted from discussions with @jdposada , @msuchard , @ and others. May be of relevance to @chrisknoll and @mgkahn.
This is from an attempt using this docker image (v 0.6) deployed in a GCP VM.
From looking at the assessment.log, this was on the 112th cohort instantiation (Cohort id 201201000). Each cohort is generated using ATLAS generated SQL. E.g. the one where the error was hit is this one.
Each of those SQL scripts has a series of insert statements. Some of these act on a newly created #codesets table (generated for each cohort), and others that insert the cohort into the cumulative study cohort table, and track inclusion / summary stats. Each of those is preceded by a delete (although this started with a clean project, so there shouldn’t have been a need to delete tables).
All of this relevant because it seems like the relevant limit is either table operations / day (1500) (which includes inserts or deletes) or possibly tables.insert method requests / 10 seconds (and tables.insert here is possibly inclusive of CREATE? – wondering here because of the snippet that’s blamed).
Some solutions being explored include 1) Looking into raising the relevant quota (preferred - but doesn't necessarily address such problems in the future). 2) Staging the execution somehow (it doesn't look like inclusion stats can be deferred as they are implicit to the SQL for each cohort, but perhaps cohort creation can proceed in batches according to the quota limit).
Discussion as to whether there's a longer term SQLRender approach to this problem, deferred to those more knowledgeable than I :)
errorReportR.txt ErrorReportSql.txt assessmentLog.txt