Error in ParallelLogger::clusterApply()

SamuelFan1215 commented 1 month ago

Hello! I faced an error while trying to run the study. I have successfully runrenv:restore() and have results for CohortGeneratorModule, CohortIncidenceModule, and CharacterizationModule. However, I faced this error while executing the CohortMethodModule step.

Error in ParallelLogger::clusterApply(cluster, objectsToCreate, doCreateCmDataObject) : 
  Error(s) when calling function 'fun', see earlier messages for details

And here is the detailed warning message:

Error:
java.sql.SQLException: [Databricks][JDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: 22018, Query: CREATE TAB***, Error message from Server: org.apache.hive.service.cli.HiveSQLException: Error running query: [CAST_INVALID_INPUT] org.apache.spark.SparkDateTimeException: [CAST_INVALID_INPUT] The value '20171201' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "ansi_mode" to "false" to bypass this error. SQLSTATE: 22018
== SQL (line 63, position 26) ==
AND cohort_start_date >= CAST('20171201' AS DATE) 
                         ^^^^^^^^^^^^^^^^^^^^^^^^

Any thought on this is helpful. Thank you!

anthonysena commented 1 month ago

Hi can you confirm the DB platform you are using? I see from the error that you are using [Databricks][JDBCDriver] but I'm having a bit of trouble reproducing this error on a DataBricks environment I have access to.

SamuelFan1215 commented 1 month ago

Hi Anthony! This is Ruochong Fan from WashU. We are using Databricks and spark SQL. Hope this gives more information. Thank you!

anthonysena commented 1 month ago

Hi Ruochong! Thanks for confirming. I'm not really sure why you are encountering this error but I am seeing some differences in the way that dates may be handled in Databricks on my side. For example:

select  CAST('20171201' AS DATE) test1,  CAST('2017-12-01' AS DATE) test2;

returns:

Noting that this is on DataBricks version 3.5.0 per this query:

select version();

Perhaps you could run the same queries on your side and we can compare?

SamuelFan1215 commented 1 month ago

Hi Anthony! I ran the sample queries and below are the results: The codes select CAST('20171201' AS DATE) test1, CAST('2017-12-01' AS DATE) test2; Actually returns an error shown below.

[CAST_INVALID_INPUT] The value '20171201' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "ansi_mode" to "false" to bypass this error. SQLSTATE: 22018
== SQL (line 1, position 9) ==
select  CAST('20171201' AS DATE) test1,  CAST('2017-12-01' AS DATE) te...
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To get the correct date in spark SQL, I think I have to do this select CAST('20171201' AS DATE) test1, CAST('2017-12-01' AS DATE) test2;

The Databricks version seems to be the same select version();

Thank you so much for the help!

anthonysena commented 1 month ago

Thanks for sharing this info. Interesting that we're seeing different behavior when issuing the query. I did have one question - how did you change the query to return the 2 valid dates? From your previous message:

To get the correct date in spark SQL, I think I have to do this select CAST('20171201' AS DATE) test1, CAST('2017-12-01' AS DATE) test2;

This query appears to be identical to what I had originally sent so just wanted to ask.

SamuelFan1215 commented 1 month ago

Sorry for the confusion! I pasted the wrong query when writing back to you... The query I used to get the valid dates is SELECT TO_DATE('20171201', 'yyyyMMdd') AS test1, CAST('2017-12-01' AS DATE) AS test2;. The sample query you provided generates an error (not even null). I think the problem is CAST('20171201' AS DATE) test1 part.

Also, I tried to change start and end dates in the fullStudyAnalysisSpecification.JSON file and execute the analyses. I changed the start and end dates in studyWindow and getDbCohortMethodDataArgs: (eg. from "20171201" to "2017-12-01"), but the codes still return an error:

Thread 1 returns error: "Date: 2017-12-01 (2017--1-2-) is not valid" when using argument(s): x = list(connectionDetails = list(dbms = "spark", extraSettings = NULL, oracleDriver = "thin", pathToDriver = "C:/Users/fanr/Desktop/Project/DatabricksJDBC42-2.6.36.1062/DatabricksJDBC42-2.6.36.1062", user = function () nrlang::eval_tidy(userExpre..., x = C:\Users\fanr\Git_Project\SemaglutideNaion\results\WUSM_Datalake\strategusWork\CohortMethodModule/CmData_l1_t17793_c17796.zip

Thank you for your careful and detailed answers!

anthonysena commented 1 month ago

Thanks @SamuelFan1215 for this information it is very helpful! Good attempt to hack the JSON but CohortMethod is going to prevent you from passing that the date in as you saw. I think what is needed is to create a patch for this in SqlRender and then I will provide you with some instructions to install this in your environment. Give me some time to put this together and will update this issue. Thanks!

anthonysena commented 1 month ago

OK - the patch is available and seems to work well from my quick testing. To install this patch, run the following in your Semaglutide R project session:

remotes::install_github("OHDSI/SqlRender", ref="issue-370-databricks-date-cast")

This will replace the version of SqlRender with this patched version. You can then re-attempt to run CohortMethod and hopefully this will allow you to get past the error. Thanks!

SamuelFan1215 commented 1 month ago

Thank you so much Anthony! It solves the date error!

ohdsi-studies / SemaglutideNaion

Error in ParallelLogger::clusterApply() #2