Open mklaber opened 4 years ago
I've received confirmation from Dataiku that this is a known limitation and there is no timeline to resolve it. Here's what was said:
We confirm that DSS does not try to interpret the new Parquet logical types, so reads the int32 and int64 as int32 and int64. You can convert the int32 version to a date with a prepare recipe using 86400 * the_int32_column and parsing this as a Unix timestamp. The int64 is likely a Unix timestamp (in milliseconds) too that you can also parse with a prepare recipe.
This is also tracked in Jira as AOBS-524.
Snowflake tables with date and timestamp columns (with timezone or not), when synced back to Dataiku via
sync_snowflake_to_hdfs
, are imported asint
andbigint
, respectively. This appears to be an issue with how Dataiku reads Parquet files rather than the plugin itself. In particular, it appears DSS expects dates and times will only be stored using the legacy and now deprecatedint96
representation. Its deprecation is described in PARQUET-323 and implemented in https://github.com/apache/parquet-format/pull/86. Snowflake's Parquet implementation outputs dates as annotatedint32
and timestamps asint64
(as LogicalTypes).Additional information about dates, times, and Parquet is described in this article.
Snowflake's Parquet schema:
DSS's Parquet schema:
Requested DSS Fix
Even if DSS continues to output
int96
, at least make sure the importer recognises and supports the more modernint32
andint64
representations.Update: this is on Dataiku's backlog but not a high priority
Possible Workarounds
Under the current implementation, a user could create a Prepare recipe that, in Dataiku's words:
A potential workaround would be to use PyArrow to read in Snowflake's parquet files and write out to new files using the
use_deprecated_int96_timestamps
flag onpyarrow.parquet.write_table
.We could also chose to convert any date/time columns to strings and then rely on users to add a Prepare recipe that parses it back to date time.
We could also fallback on CSV rather than Parquet, but that's undesirable for many reasons. (And defeats the whole point of this plugin...)
How to replicate
Prereqs:
STAGE
or other location toCOPY INTO
from Snowflakeparquet-tools
(on macOS,brew install parquet-tools
)You can either use Snowflake and DSS to create example Parquet files, or use the two files in the attached parquet_examples.zip file:
part-r-00000.snappy.parquet
-- file from Dataiku withint96
part_0_0_0.snappy.parquet
-- file from Snowflake withint32
andint64
First, create a sample parquet file in Snowflake:
Copy the file locally and inspect the schema using
parquet-tools schema
:Also, create a "Hadoop HDFS dataset" in Dataiku that points to this file. Note that the schema gets loaded as
int
for dates andbigint
for timestamps:NB: we've tried picking various flavours of Parquet (Hive, Pig, and Spark) and manually setting the schema's data type to
date
with no success.Now, in Dataiku, create a dummy dataset with the same types. It can be Snowflake, or anything else. Example:
Add a Sync recipe with "Parquet" as the Format, and run it:
Copy the file locally and again run
parquet-tools schema
(this output also appears in the Activity Log of the sync execution):As you can see from the Parquet schema, Dataiku outputs using
int96
for date values.