oxford-pharmacoepi / DatabaseCharacterisation

Apache License 2.0
0 stars 0 forks source link

Problem running CodeToRun.R #2

Open dandedman opened 2 months ago

dandedman commented 2 months ago

Hi @catalamarti Getting an error running the code on our 50 practice database. I see several log files - attaching the latest one only: log_2024_05_02_19_44_40.txt

Here's the console output:

> minCellCount <- 5
> # Run the study
> source(here("RunStudy.R"))
Error in `checkNA()`:
! `variable_name` must not contain NA.
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: The following columns will be overwritten: provider_id, care_site_id 
2: The following columns will be overwritten: provider_id 
3: The following columns will be overwritten: provider_id 
4: The following columns will be overwritten: provider_id 
5: The following columns will be overwritten: provider_id 
6: The following columns will be overwritten: provider_id 
7: The following columns will be overwritten: provider_id 
> rlang::last_trace()
<error/rlang_error>
Error in `checkNA()`:
! `variable_name` must not contain NA.
---
Backtrace:
     x
  1. +-base::source(here("RunStudy.R"))
  2. | +-base::withVisible(eval(ei, envir))
  3. | \-base::eval(ei, envir)
  4. |   \-base::eval(ei, envir)
  5. +-base::source(here("Analyses", "1-SummariseClinicalTables.R")) at DatabaseCharacterisation/Study/RunStudy.R:50:0
  6. | +-base::withVisible(eval(ei, envir))
  7. | \-base::eval(ei, envir)
  8. |   \-base::eval(ei, envir)
  9. +-omopgenerics::bind(...) at Study/Analyses/1-SummariseClinicalTables.R:59:2
 10. +-omopgenerics:::bind.summarised_result(...)
 11. \-global summaryCodeCounts(cdm[[table]], ageGroups) at Study/Analyses/1-SummariseClinicalTables.R:59:2
 12.   \-omopgenerics::newSummarisedResult(...) at Study/Analyses/functions.R:653:2
 13.     \-omopgenerics:::validateSummariseResult(x)
 14.       \-omopgenerics:::checkNA(x = x, "summarised_result")
Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.
catalamarti commented 2 months ago

hi @dandedman look there was good progress in the log file. I think that this error is because there are NA in death table. I can amend the code to make sure that this is allowed. To check if I did the correct diagnostics can you run the following code?

x <- cdm$death |> dplyr::summarise(dplyr::across(dplyr::everything(), ~ sum(dplyr::if_else(is.na(.x), 1, 0)))) |> dplyr::collect()
x
dandedman commented 2 months ago

HI @catalamarti :- yes - we do not have cause of death in the primary care so these fields are left null. (do you do something different with GOLD?)

> x <- cdm$death |> dplyr::summarise(dplyr::across(dplyr::everything(), ~ sum(dplyr::if_else(is.na(.x), 1, 0)))) |> dplyr::collect()
Warning message:
Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours. 
> x
# A tibble: 1 x 7
  person_id death_date death_datetime death_type_concept_id cause_concept_id cause_source_value cause_source_concept_id
      <dbl>      <dbl>          <dbl>                 <dbl>            <dbl>              <dbl>                   <dbl>
1         0          0              0                     0           101139             101139                  101139
> 
catalamarti commented 2 months ago

@dandedman so in principle cause_concept_id and source_concept_id should be 0 (instead of NA), but hopefully the new version (https://github.com/oxford-pharmacoepi/DatabaseCharacterisation/commit/a1bc89cadad9663675c15c74c4fb9642fccfb045) should make it work

dandedman commented 2 months ago

Hi @catalamarti: the new version failed with same error

re the use of Null vs 0 in concept_id fields I see there is an issue (plus link to a long and testy exchange in the OHDSI forums) in the Themis repository https://github.com/OHDSI/Themis/issues/11. We will update cause_concept_id and source_concept_id to use 0 in a subsequent ETL - though I note the recommendation is not yet implemented in the CDM definition (which still allows null for these fields), so this problem could recur for the moment.

catalamarti commented 2 months ago

hi @dandedman I've manually corrected the NA to 0. https://github.com/oxford-pharmacoepi/DatabaseCharacterisation/commit/a508848c5e6fc75f3dc0bb5110b4af066cae6822 I tested with an small database with NA and it worked, could you try again with the new version? Thank you very much!

dandedman commented 1 month ago

Hi @catalamarti unfortunately I'm in Mallorca and I couldn't fit the laptop in my bag. Will try it first thing Monday!

dandedman commented 1 month ago

Hi @catalamarti

It failed at the same point again.

Are the results generated to this point still useful, and if so is it worth me running this on the full db so at least we have something to look at tomorrow?

log_2024_05_13_09_48_15.txt

Run the study

source(here("RunStudy.R")) Error in checkNA(): ! variable_name must not contain NA. Run rlang::last_trace() to see where the error occurred. Warning messages: 1: The following columns will be overwritten: provider_id, care_site_id 2: The following columns will be overwritten: provider_id 3: The following columns will be overwritten: provider_id 4: The following columns will be overwritten: provider_id 5: The following columns will be overwritten: provider_id 6: The following columns will be overwritten: provider_id 7: The following columns will be overwritten: provider_id rlang::last_trace() <error/rlang_error> Error in checkNA(): ! variable_name must not contain NA.

Backtrace: x

  1. +-base::source(...)
  2. | +-base::withVisible(eval(ei, envir))
  3. | -base::eval(ei, envir)
  4. | -base::eval(ei, envir)
  5. +-base::source(here("RunStudy.R")) at DatabaseCharacterisation/Study/CodeToRun_Aurum50.R:49:0
  6. | +-base::withVisible(eval(ei, envir))
  7. | -base::eval(ei, envir)
  8. | -base::eval(ei, envir)
  9. +-base::source(here("Analyses", "1-SummariseClinicalTables.R")) at DatabaseCharacterisation/Study/RunStudy.R:50:0
  10. | +-base::withVisible(eval(ei, envir))
  11. | -base::eval(ei, envir)
  12. | -base::eval(ei, envir)
  13. +-omopgenerics::bind(...) at Study/Analyses/1-SummariseClinicalTables.R:59:2
  14. +-omopgenerics:::bind.summarised_result(...)
  15. -global summaryCodeCounts(cdm[[table]], ageGroups) at Study/Analyses/1-SummariseClinicalTables.R:59:2
  16. -omopgenerics::newSummarisedResult(...) at Study/Analyses/functions.R:657:2
  17. -omopgenerics:::validateSummariseResult(x)
  18. -omopgenerics:::checkNA(x = x, "summarised_result") Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.
catalamarti commented 1 month ago

hi @dandedman, in fact the code run way more than previously. The problem is that concept_id was used twice and I only corrected in the first one. Could you please check if the new update (https://github.com/oxford-pharmacoepi/DatabaseCharacterisation/commit/b11d84ff361d166f5e66891308b4016ba94dd39b) fixed it?

dandedman commented 1 month ago

@catalamarti : great, that worked. Will try on full db now. Thanks for the help

dandedman commented 1 month ago

@catalamarti: there are almost 6 billion rows in visit_occurrence - hence the overflow with the SQL COUNT(). Needs COUNT_BIG() maybe?

source(here("RunStudy.R")) Error in dplyr::collect(): ! Failed to collect lazy table. Caused by error: ! nanodbc/nanodbc.cpp:1771: 22003 [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Arithmetic overflow error converting expression to data type int.

'SELECT COUNT(*) AS "number_records-count", COUNT(DISTINCT "person_id") AS "number_subjects-count" FROM "CDM"."visit_occurrence"' Run `rlang::last_trace()` to see where the error occurred. rlang::last_trace() Error in `dplyr::collect()`: ! Failed to collect lazy table. Caused by error: ! nanodbc/nanodbc.cpp:1771: 22003 [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Arithmetic overflow error converting expression to data type int. 'SELECT COUNT(*) AS "number_records-count", COUNT(DISTINCT "person_id") AS "number_subjects-count" FROM "CDM"."visit_occurrence"' --- Backtrace: x 1. +-base::source(...) 2. | +-base::withVisible(eval(ei, envir)) 3. | \-base::eval(ei, envir) 4. | \-base::eval(ei, envir) 5. +-base::source(here("RunStudy.R")) at DatabaseCharacterisation/Study/CodeToRun_Aurum2309.R:49:0 6. | +-base::withVisible(eval(ei, envir)) 7. | \-base::eval(ei, envir) 8. | \-base::eval(ei, envir) 9. +-base::source(here("Analyses", "1-SummariseClinicalTables.R")) at DatabaseCharacterisation/Study/RunStudy.R:50:0 10. | +-base::withVisible(eval(ei, envir)) 11. | \-base::eval(ei, envir) 12. | \-base::eval(ei, envir) 13. +-omopgenerics::bind(qualityChecks, summaryQuality(cdm[[table]])) at Study/Analyses/1-SummariseClinicalTables.R:13:2 14. +-omopgenerics:::bind.summarised_result(qualityChecks, summaryQuality(cdm[[table]])) 15. +-global summaryQuality(cdm[[table]]) at Study/Analyses/1-SummariseClinicalTables.R:13:2 16. | +-dplyr::mutate(...) at Study/Analyses/functions.R:182:2 17. | +-tidyr::pivot_longer(...) at Study/Analyses/functions.R:182:2 18. | +-dplyr::mutate(...) at Study/Analyses/functions.R:182:2 19. | +-dplyr::mutate(...) at Study/Analyses/functions.R:182:2 20. | +-dplyr::collect(...) at Study/Analyses/functions.R:182:2 21. | \-omopgenerics:::collect.cdm_table(...) 22. | +-dplyr::collect(removeClass(x, "cdm_table")) 23. | \-dbplyr:::collect.tbl_sql(removeClass(x, "cdm_table")) 24. | +-base::withCallingHandlers(...) 25. | +-dbplyr::db_collect(...) 26. | \-dbplyr:::db_collect.DBIConnection(...) 27. | +-DBI::dbSendQuery(con, sql) 28. | \-odbc::dbSendQuery(con, sql) 29. | \-odbc (local) .local(conn, statement, ...) 30. | \-odbc:::OdbcResult(...) 31. | \-odbc:::new_result(p = connection@ptr, sql = statement, immediate = immediate) 32. \-base::stop(``) Run rlang::last_trace(drop = FALSE) to see 3 hidden frames.
catalamarti commented 1 month ago

hi @dandedman so I think I have an idea to make this work, but first I would like to check something. Can you try to run this:

cdm$visit_occurrence |>
  dplyr::summarise(n = dplyr::n())

cdm$visit_occurrence |>
  dplyr::summarise(n = as.character(dplyr::n()))

If my fix works, the first code should crash, and the second should work.

dandedman commented 1 month ago

HI @catalamarti

first one fails:

> cdm$visit_occurrence |>
+     dplyr::summarise(n = dplyr::n())
Error in `dplyr::collect()`:
! Failed to collect lazy table.
Caused by error:
! nanodbc/nanodbc.cpp:1771: 22003
[Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Arithmetic overflow error converting expression to data type int. 
<SQL> 'SELECT TOP 11 COUNT(*) AS "n"
FROM "CDM"."visit_occurrence"'
Run `rlang::last_trace()` to see where the error occurred.

but so does second version:

> cdm$visit_occurrence |>
+     dplyr::summarise(n = as.character(dplyr::n()))
Error in `dplyr::collect()`:
! Failed to collect lazy table.
Caused by error:
! nanodbc/nanodbc.cpp:1771: 22003
[Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Arithmetic overflow error converting expression to data type int. 
<SQL> 'SELECT TOP 11 TRY_CAST(COUNT(*) AS VARCHAR(MAX)) AS "n"
FROM "CDM"."visit_occurrence"'
Run `rlang::last_trace()` to see where the error occurred.

hmm! I am sat in a cafe in Oxford atm waiting for our workshop to start - are you going to be there? Could we meet there at 12:30?

dandedman commented 1 month ago

I can see that SQL will fail though - the COUNT(*) will cause an overflow error in each case

catalamarti commented 1 month ago

Probably easily if we do a live debug when we see each other at 12:30

dandedman commented 1 month ago

are you at the venue already? I am about to get thrown out of here for not buying enough coffee!

catalamarti commented 1 month ago

we are in waterstones cafe, if you want to come

edward-burn commented 1 month ago

@dandedman can you try installing this fork https://github.com/oxford-pharmacoepi/dbplyr/tree/sql_server_count_big

remotes::install_github("oxford-pharmacoepi/dbplyr@sql_server_count_big")