qsbase / qs

Quick serialization of R objects
405 stars 19 forks source link

Saving big list: Error in c_qsave(x, file, ...): bad binding access #50

Closed AdrianAntico closed 3 years ago

AdrianAntico commented 3 years ago

Hi qs team,

I'm looking to save a list of 14 elements and I'm running into this error: "Error in c_qsave(x, file, preset, algorithm, compress_level, shuffle_control, : bad binding access"

The list contains several data.table's, a model object, a list of plots, individual plots, and even null elements at times. When I use save() it save without issue. If I'm using qs::qsave() inappropriately then my apologies ahead of time. Below are my computer specs and some code to recreate the error. Let me know if you need anything else to help troubleshoot.

PS Great package!

I'm working on a windows machine and here is the session info:

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] timeDate_3043.102

loaded via a namespace (and not attached): [1] Rcpp_1.0.5 compiler_4.0.3 pillar_1.4.6
[4] qs_0.23.5 iterators_1.0.12 tools_4.0.3
[7] catboost_0.24.3 digest_0.6.25 viridisLite_0.3.0
[10] lubridate_1.7.9 jsonlite_1.7.2 lifecycle_0.2.0
[13] tibble_3.0.4 gtable_0.3.0 lattice_0.20-41
[16] pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18
[19] foreach_1.5.0 rstudioapi_0.11 crosstalk_1.1.0.1
[22] yaml_2.2.1 parallel_4.0.3 RemixAutoML_0.3.3
[25] httr_1.4.1 dplyr_1.0.2 generics_0.0.2
[28] arules_1.6-6 vctrs_0.3.2 htmlwidgets_1.5.1
[31] grid_4.0.3 tidyselect_1.1.0 RApiSerialize_0.1.0 [34] glue_1.4.1 data.table_1.13.2 R6_2.4.1
[37] plotly_4.9.2.1 farver_2.0.3 tidyr_1.1.2
[40] ggplot2_3.3.2 purrr_0.3.4 magrittr_1.5
[43] scales_1.1.1 codetools_0.2-16 ellipsis_0.3.1
[46] htmltools_0.5.0 colorspace_1.4-1 labeling_0.3
[49] stringfish_0.14.2 RcppParallel_5.0.2 lazyeval_0.2.2
[52] doParallel_1.0.15 munsell_0.5.0 crayon_1.3.4

Code to recreate the error is below:

# Load data
data <- data <- data.table::fread("https://www.dropbox.com/s/2str3ek4f4cheqi/walmart_train.csv?dl=1")

# Set negative numbers to 0
data <- data[, Weekly_Sales := data.table::fifelse(Weekly_Sales < 0, 0, Weekly_Sales)]

# Remove IsHoliday column
data[, IsHoliday := NULL]

# Change data types
data[, ":=" (Store = as.character(Store), Dept = as.character(Dept))]

# Fill gaps
data <- RemixAutoML::TimeSeriesFill(
  data,
  DateColumnName = "Date",
  GroupVariables = c("Store","Dept"),
  TimeUnit = "weeks",
  FillType = "maxmax",
  MaxMissingPercent = 0.25,
  SimpleImpute = TRUE)

# Shrink data for example
data <- data[Store %in% c(1:3)]

# Shrink data rows
data <- data[Date < "2012-03-09"]

# Build model
TestModel <- RemixAutoML::AutoCatBoostCARMA(

  # data args
  data = data,
  TimeWeights = 0.9999,
  TargetColumnName = "Weekly_Sales",
  DateColumnName = "Date",
  HierarchGroups = NULL,
  GroupVariables = c("Store","Dept"),
  TimeUnit = "weeks",
  TimeGroups = c("weeks","months"),

  # Production args
  TrainOnFull = FALSE,
  SplitRatios = c(1 - 10 / 110, 10 / 110),
  PartitionType = "random",
  FC_Periods = 33,
  TaskType = "GPU",
  NumGPU = 1,
  Timer = TRUE,
  DebugMode = TRUE,

  # Target variable transformations
  TargetTransformation = FALSE,
  Methods = c("YeoJohnson", "BoxCox", "Asinh", "Log", "LogPlus1", "Sqrt", "Asin", "Logit"),
  Difference = FALSE,
  NonNegativePred = TRUE,
  RoundPreds = FALSE,

  # Calendar-related features
  CalendarVariables = c("week","wom","month","quarter"),
  HolidayVariable = c("USPublicHolidays"),
  HolidayLags = c(1,2,3),
  HolidayMovingAverages = c(2,3),

  # Lags, moving averages, and other rolling stats
  Lags = list("weeks" = c(1,2,3,4,5,8,9,12,13,51,52,53), "months" = c(1,2,6,12)),
  MA_Periods = list("weeks" = c(2,3,4,5,8,9,12,13,51,52,53), "months" = c(2,6,12)),
  SD_Periods = NULL,
  Skew_Periods = NULL,
  Kurt_Periods = NULL,
  Quantile_Periods = NULL,
  Quantiles_Selected = NULL,

  # Bonus features
  AnomalyDetection = NULL,
  XREGS = NULL,
  FourierTerms = 0,
  TimeTrendVariable = TRUE,
  ZeroPadSeries = NULL,
  DataTruncate = FALSE,

  # ML grid tuning args
  GridTune = FALSE,
  PassInGrid = NULL,
  ModelCount = 5,
  MaxRunsWithoutNewWinner = 50,
  MaxRunMinutes = 60*60,

  # ML evaluation output
  PDFOutputPath = NULL,
  SaveDataPath = NULL,
  NumOfParDepPlots = 0L,

  # ML loss functions
  EvalMetric = "RMSE",
  EvalMetricValue = 1,
  LossFunction = "RMSE",
  LossFunctionValue = 1,

  # ML tuning args
  NTrees = 50L,
  Depth = 6L,
  L2_Leaf_Reg = NULL,
  LearningRate = NULL,
  Langevin = FALSE,
  DiffusionTemperature = 10000,
  RandomStrength = 1,
  BorderCount = 254,
  RSM = NULL,
  GrowPolicy = "SymmetricTree",
  BootStrapType = "Bayesian",
  ModelSizeReg = 0.5,
  FeatureBorderType = "GreedyLogSum",
  SamplingUnit = "Group",
  SubSample = NULL,
  ScoreFunction = "Cosine",
  MinDataInLeaf = 1)

# Save output (Error on this step)
qs::qsave(TestModel, file = file.path(getwd(), "Insights.Rdata"))

# Comparison (this works)
save(TestModel, file = file.path(getwd(), "Insights.Rdata"))
traversc commented 3 years ago

Thanks for the report, I believe I know how to fix the issue. Could you help me running the example?

Running the model I get "Error in catboost::catboost.train(learn_pool = TrainPool, test_pool = TestPool, : catboost/private/libs/algo/tensor_search_helpers.cpp:455: No groups in dataset. Please disable sampling or use per object sampling"

(Also I switched to CPU if that makes a difference)

AdrianAntico commented 3 years ago

@traversc sorry about that. I forgot to NULL out the GroupVariables argument...

# Load data
data <- data <- data.table::fread("https://www.dropbox.com/s/2str3ek4f4cheqi/walmart_train.csv?dl=1")

# Set negative numbers to 0
data <- data[, Weekly_Sales := data.table::fifelse(Weekly_Sales < 0, 0, Weekly_Sales)]

# Remove IsHoliday column
data[, IsHoliday := NULL]

# Change data types
data[, ":=" (Store = as.character(Store), Dept = as.character(Dept))]

# Fill gaps
data <- RemixAutoML::TimeSeriesFill(
  data,
  DateColumnName = "Date",
  GroupVariables = c("Store","Dept"),
  TimeUnit = "weeks",
  FillType = "maxmax",
  MaxMissingPercent = 0.25,
  SimpleImpute = TRUE)

# Shrink data for example
data <- data[Store %in% c(1:3)]

# Shrink data rows
data <- data[Date < "2012-03-09"]

# Build model
TestModel <- RemixAutoML::AutoCatBoostCARMA(

  # data args
  data = data,
  TimeWeights = 0.9999,
  TargetColumnName = "Weekly_Sales",
  DateColumnName = "Date",
  HierarchGroups = NULL,
  GroupVariables = NULL,
  TimeUnit = "weeks",
  TimeGroups = c("weeks","months"),

  # Production args
  TrainOnFull = FALSE,
  SplitRatios = c(1 - 10 / 110, 10 / 110),
  PartitionType = "random",
  FC_Periods = 33,
  TaskType = "GPU",
  NumGPU = 1,
  Timer = TRUE,
  DebugMode = TRUE,

  # Target variable transformations
  TargetTransformation = FALSE,
  Methods = c("YeoJohnson", "BoxCox", "Asinh", "Log", "LogPlus1", "Sqrt", "Asin", "Logit"),
  Difference = FALSE,
  NonNegativePred = TRUE,
  RoundPreds = FALSE,

  # Calendar-related features
  CalendarVariables = c("week","wom","month","quarter"),
  HolidayVariable = c("USPublicHolidays"),
  HolidayLags = c(1,2,3),
  HolidayMovingAverages = c(2,3),

  # Lags, moving averages, and other rolling stats
  Lags = list("weeks" = c(1,2,3,4,5,8,9,12,13,51,52,53), "months" = c(1,2,6,12)),
  MA_Periods = list("weeks" = c(2,3,4,5,8,9,12,13,51,52,53), "months" = c(2,6,12)),
  SD_Periods = NULL,
  Skew_Periods = NULL,
  Kurt_Periods = NULL,
  Quantile_Periods = NULL,
  Quantiles_Selected = NULL,

  # Bonus features
  AnomalyDetection = NULL,
  XREGS = NULL,
  FourierTerms = 0,
  TimeTrendVariable = TRUE,
  ZeroPadSeries = NULL,
  DataTruncate = FALSE,

  # ML grid tuning args
  GridTune = FALSE,
  PassInGrid = NULL,
  ModelCount = 5,
  MaxRunsWithoutNewWinner = 50,
  MaxRunMinutes = 60*60,

  # ML evaluation output
  PDFOutputPath = NULL,
  SaveDataPath = NULL,
  NumOfParDepPlots = 0L,

  # ML loss functions
  EvalMetric = "RMSE",
  EvalMetricValue = 1,
  LossFunction = "RMSE",
  LossFunctionValue = 1,

  # ML tuning args
  NTrees = 50L,
  Depth = 6L,
  L2_Leaf_Reg = NULL,
  LearningRate = NULL,
  Langevin = FALSE,
  DiffusionTemperature = 10000,
  RandomStrength = 1,
  BorderCount = 254,
  RSM = NULL,
  GrowPolicy = "SymmetricTree",
  BootStrapType = "Bayesian",
  ModelSizeReg = 0.5,
  FeatureBorderType = "GreedyLogSum",
  SamplingUnit = "Group",
  SubSample = NULL,
  ScoreFunction = "Cosine",
  MinDataInLeaf = 1)

# Save output (Error on this step)
qs::qsave(TestModel, file = file.path(getwd(), "Insights.Rdata"))

# Comparison (this works)
save(TestModel, file = file.path(getwd(), "Insights.Rdata"))
traversc commented 3 years ago

Hi Adrian, I think I fixed the issue. Could you try it out?

devtools::install_github("traversc/qs@5e29db0db2a2c605dd878d18f9e6fe55e7a4027c")

Then run your example.

AdrianAntico commented 3 years ago

Hi @traversc I just tested it out and it worked!

AdrianAntico commented 3 years ago

@traversc Here's a benchmark on the example I posted (just he saving to file part)

Unit: milliseconds expr min lq mean median uq max neval qs::qsave(TestModel, file = file.path(getwd(), "Insights.Rdata")) 527.2452 535.0064 608.3629 548.0963 699.4605 710.1596 30 save(TestModel, file = file.path(getwd(), "Insights.Rdata")) 3854.6459 3899.4460 3901.8063 3903.3161 3906.4496 3917.0938 30