richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

Error: NOT NULL constraint failed: datatable.value - Storing large (~1.1GB) dataframe in SQLite cache #120

Open strazto opened 4 years ago

strazto commented 4 years ago

Originally posted here: https://github.com/ropensci/drake/issues/1201

Overview

Within my drake workflow, when trying to store a particular dataframe, I keep getting the same error:

Error: NOT NULL constraint failed: datatable.value

This always occurs at the stage of attempting to store my dataframe

With some digging, I am able to find the name "datatable" wrt my cache:

cache$driver$tbl_data
#> [1] "datatable"

After reviewing source for driver_dbi, I notice where this NOT NULL constraint is set. I also notice the definitions for set_object.

I wonder if anyone has any idea how this datatable.value might be taken as NULL?

Lines

NOT NULL constraint set here:

https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/driver_dbi.R#L171-L173

Defining set_object SQL Call:

https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/driver_dbi.R#L514-L515

Defining set_object method for driver_dbi:

https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/driver_dbi.R#L305-L312

wlandau commented 4 years ago

What happens when you run your data frame on the minimal example below?

library(RSQLite)
library(storr)
# Set up the DBI storr.
drv <- dbDriver("SQLite")
con <- dbConnect(drv, "db_file.db")
s <- storr::storr_dbi(
  "tbl_data",
  "tbl_keys",
  con
)
# Load or create your data.
data <- data.frame(x = 1) # Different for you.
s$set("data", data) # Should reproduce the error.
s$get("data")
#>   x
#> 1 1

Created on 2020-03-05 by the reprex package (v0.3.0)

strazto commented 4 years ago

What happens when you run your data frame on the minimal example below?

I'll give this a try now, thanks @wlandau

Update

As you said, the error was reproduced.

library(RSQLite)
library(storr)
# Set up the DBI storr.
drv <- dbDriver("SQLite")
con <- dbConnect(drv, "db_file.db")
s <- storr::storr_dbi(
  "tbl_data",
  "tbl_keys",
  con
)
# Load or create your data.
data <- readr::read_rds("~/Documents/workflows/bulk_journey_analysis_base_combo.rds")
s$set("data", data) # Should reproduce the error.
s$get("data")
#> Error: NOT NULL constraint failed: tbl_data.value
strazto commented 4 years ago

I dug into the stack trace the other day and investigated this - I'll have to do it again since some time has since passed - From what I saw, the dataframe seemed to pass through storr relatively intact- it was at some DBI method dispatch where all of a sudden thing were not as they should have been.

I could swear I raised an issue or documented it somewhere but now I can't find where, I might have to find it again

strazto commented 4 years ago

@wlandau
In https://github.com/ropensci/drake/issues/1201#issuecomment-595398007 you asked:

Also, is there a reason you commented out format = "fst"? Seems like it would avoid the error and run much faster.

Getting fst to compile on my HPC has proved painful.

strazto commented 4 years ago

Okay, debugging in the stack trace now -

traceback()
19: stop(list(message = "NOT NULL constraint failed: tbl_data.value", 
        call = NULL, cppstack = NULL))
18: result_bind(res@ptr, params)
17: db_bind(res, as.list(params), ..., allow_named_superset = FALSE)
16: dbBind(rs, params)
15: dbBind(rs, params)
14: .local(conn, statement, ...)
13: dbSendQuery(conn, statement, ...)
12: dbSendQuery(conn, statement, ...)
11: dbSendStatement(conn, statement, ...)
10: dbSendStatement(conn, statement, ...)
9: DBI::dbExecute(self$con, self$sql$set_object, dat)
8: DBI::dbExecute(self$con, self$sql$set_object, dat)
7: self$driver$set_object(hash, value_send)
6: self$set_value(value, use_cache)
5: s$set("data", data) at test_cache_break.R#13
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/work/repos/SPEED-EXTRACT.pipeline/inst/investigations/test_cache_break.R", 
       echo = TRUE)

At the point of RSQLite::result_bind, I can perform the following:

ls()
#> [1] "params" "res"   

params is a large list, containing a small hash, and what appears to be a serialized version of my df (Raw).

str(params)
#> List of 2
#> $ : chr "ac42c7b24b2ed64c449dc6de4e1590a2"
#> $ :List of 1
#>  ..$ : raw [1:1485753874] 58 0a 00 00 ...

res is an externalptr with some value.

The only statement in result_bind is really just forwarding onto a C++ call,

invisible(.Call(`_RSQLite_result_bind`, res, params))