Eligibility seed file not loading in Databricks because of datetime data type issue

tuva-health / tuva_demo

A starter dbt project and synthetic claims dataset for trying out the Tuva Project.

https://thetuvaproject.com/

Apache License 2.0

14 stars 17 forks source link

Eligibility seed file not loading in Databricks because of datetime data type issue #43

Open cocozuloaga opened 3 months ago

cocozuloaga commented 3 months ago

Describe the bug Eligibility seed file not loading in Databricks because DATETIME data type is not supported.

To Reproduce Run Tuva Demo on Databricks.

Screenshot 2024-07-29 at 10 26 45 AM

yubinmimi commented 3 months ago

console_output (1).txt

I changed the input_datetime to timestamp format in the yaml file, but still getting the error. The log file is attached. Thanks so much!

yubinmimi commented 3 months ago

It seems like this line has something to do with it, but can't figure out much:

: [CSV_ENFORCE_SCHEMA_NOT_SUPPORTED] The CSV option enforceSchema cannot be set when using rescuedDataColumn or failOnUnknownFields, as columns are read by name rather than ordinal. SQLSTATE: 0A000

yubinmimi commented 3 months ago

So... @cocozuloaga I made it work with some hacks, but I guess this is just a temporary solution.

I edited the_tuva_project.macros.load_seed.sql, and there I made 'enforceSchema' = 'false', (line 194). Then, I also changed all the datetime variables for other tables, e.g. lab_result, observation, to detect "databricks" and change it to timestamp.

Then, all the synthetic datasets are loaded to the databricks beautifully.

cocozuloaga commented 3 months ago

@yubinmimi glad that worked and you were able to load the seeds! We'll test in our Databricks environment as soon as we have it ready and push a fix to tuva_demo so that the hack you described is not necessary. Thanks for the feedback!