presagia-analytics / ctrialsgov

Query Data from ClinicalTrials.gov
https://presagia-analytics.github.io/ctrialsgov/
Other
13 stars 3 forks source link

Separate connection for creating database from reading from it #17

Closed kaneplusplus closed 2 years ago

kaneplusplus commented 2 years ago

I tend to write code like the following, which writes the raw database to duckdb and then writes the derived database.

library(ctrialsgov)
  ctgov_create_duckdb(
    basedir = dirname,
    dbdir = "ctgov.duckdb"
  ) |> ctgov_create_data('ctgov-derived.duckdb')

Does it make sense to create separate "write connections" for ctgov_create_duckdb() and ctgov_create_data(), which are explicitly closed after the write operation is complete? Other connections could then use readonly connections.

I'm mostly asking because I know that if all connections to a database are read only. Read-locks run with almost no overhead (unless there a write is happening). Also, I'm not sure it's a good idea for users to be able to fish through .volatiles, and use the connection to make changes to the database.

statsmaths commented 2 years ago

I updated the codebase to only pass around readonly connections. The only exception is the .volatiles$memory connection that is used a workspace for the query function and needs to remain open. Everything there is temporary, though. As a future reminder to myself, some dbConnect signatures take a read_only argument, but for duckdb this needs to be done at the driver level.

One remaining issue, though, is that in your example you are passing the connection directly to ctgov_create_data. This will produce once of the warnings about the connection not begin closed when it's garbage collected (since you have no reference to the connection, there is no way to close it). There are a few options here:

  1. The function ctgov_create_duckdb could just create the database, but not return a connection. Then, we just pass the dbdir location as a string to ctgov_create_data. The latter can close the connection once it is finished. Since we are closing and reopening the connection to make it read-only, there is no efficiency lost here.
  2. If we wrap the return from ctgov_create_duckdb in an environment, we can call reg.finalizer on it to cleanup the connection.
  3. The function ctgov_create_data can close whatever connection is given to it.
  4. Don't use pipes in this situation and hold the user accountable to closing the connection.

Preferences? None are particularly difficult to code.