splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
390 stars 9 forks source link

CREATE EXTERNAL TABLE with empty Parquet file fails with `Invalid data type for Delta Lake: Null` #423

Closed mildbyte closed 11 months ago

mildbyte commented 1 year ago

Using this Parquet file: issue_reactions.parquet.tar.gz (from a Splitgraph DDN export of an empty query result)

CREATE EXTERNAL TABLE tmp_pipboubhthuymqqp STORED AS PARQUET LOCATION 'path/to/issue_reactions.parquet';

results in this error:

Arrow error: Schema error: Invalid data type for Delta Lake: Null')
gruuya commented 1 year ago

Note that CREATE EXTERNAL TABLE statement goes through, the error occurs when trying to create an actual delta table (i.e. materialize it from the external table):

$ curl -i -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
{"query": "CREATE EXTERNAL TABLE test STORED AS PARQUET LOCATION '/Users/markogrujic/Downloads/issue_reactions.parquet'"}
EOF
HTTP/1.1 200 OK
content-type: application/octet-stream
vary: Authorization, Content-Type, Origin, X-Seafowl-Query
content-length: 0
date: Tue, 30 May 2023 08:23:21 GMT

$ curl -i -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
{"query": "CREATE TABLE test AS SELECT * FROM staging.test"}
EOF
HTTP/1.1 400 Bad Request
vary: Authorization, Content-Type, Origin, X-Seafowl-Query
content-length: 65
date: Tue, 30 May 2023 08:23:46 GMT

Arrow error: Schema error: Invalid data type for Delta Lake: Null

The reason is that delta-rs does not handle arrow_schema::datatype::DataType::Null; in fact, a proper type for this case seems to be lacking in the Delta protocol itself: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#schema-serialization-format

milesrichardson commented 1 year ago

It sounds like this might not be easily fixable. But is it at least detectable? I think it would be helpful if the error message contained some specific indication that the error was due to the table being empty.

Alternatively/additionally (and separately to any issue here), we could have Splitgraph forward the same error to the user to indicate the reason (atm you just see a failed export and need to check the logs). Or Splitgraph could return an error if the query is empty, before attempting to import the empty result into Seafowl.

Maybe the ideal behavior would be to replace the import with a CREATE TABLE statement that creates the table with a schema matching the (expected?) schema of the (empty) query result, since we presumably have that available at the time of exporting from Splitgraph (we have the schema for all tables, and either we're exporting a table itself with SELECT *, or we're exporting the result of some query that references tables for which we know the schema).

milesrichardson commented 1 year ago

Note that CREATE EXTERNAL TABLE statement goes through, the error occurs when trying to create an actual delta table (i.e. materialize it from the external table):

Or maybe the fix is to simply short-circuit the import process to stop after CREATE EXTERNAL TABLE, and not return an error, since technically the table was created with the correct schema, and the error happened when trying to load (non-existent) rows into it. (IIUC)

I guess this might also be a fix that should be applied to Splitgraph export code (which calls CREATE EXTERNAL TABLE) rather than to Seafowl itself?

gruuya commented 1 year ago

But is it at least detectable?

It seems it is; I could make the external table creation itself fail in such a case.

Though come to think of it, I think there's some weird compatibility problem here, since this particular file was created through pandas Python lib:

$ parquet meta /Users/markogrujic/Downloads/issue_reactions.parquet

File path:  /Users/markogrujic/Downloads/issue_reactions.parquet
Created by: parquet-cpp-arrow version 11.0.0
Properties:
        pandas: {"index_columns": [], "column_indexes": [], "columns": [{"name": "id", "field_name": "id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "node_id", "field_name": "node_id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "content", "field_name": "content", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "created_at", "field_name": "created_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "user", "field_name": "user", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "repository", "field_name": "repository", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "issue_number", "field_name": "issue_number", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_ab_id", "field_name": "_airbyte_ab_id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_emitted_at", "field_name": "_airbyte_emitted_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_normalized_at", "field_name": "_airbyte_normalized_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_issue_reactions_hashid", "field_name": "_airbyte_issue_reactions_hashid", "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "1.5.3"}
  ARROW:schema: /////4AIAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAwGAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAANUFAAB7ImluZGV4X2NvbHVtbnMiOiBbXSwgImNvbHVtbl9pbmRleGVzIjogW10sICJjb2x1bW5zIjogW3sibmFtZSI6ICJpZCIsICJmaWVsZF9uYW1lIjogImlkIiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJub2RlX2lkIiwgImZpZWxkX25hbWUiOiAibm9kZV9pZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiY29udGVudCIsICJmaWVsZF9uYW1lIjogImNvbnRlbnQiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImNyZWF0ZWRfYXQiLCAiZmllbGRfbmFtZSI6ICJjcmVhdGVkX2F0IiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ1c2VyIiwgImZpZWxkX25hbWUiOiAidXNlciIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicmVwb3NpdG9yeSIsICJmaWVsZF9uYW1lIjogInJlcG9zaXRvcnkiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImlzc3VlX251bWJlciIsICJmaWVsZF9uYW1lIjogImlzc3VlX251bWJlciIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiX2FpcmJ5dGVfYWJfaWQiLCAiZmllbGRfbmFtZSI6ICJfYWlyYnl0ZV9hYl9pZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiX2FpcmJ5dGVfZW1pdHRlZF9hdCIsICJmaWVsZF9uYW1lIjogIl9haXJieXRlX2VtaXR0ZWRfYXQiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIl9haXJieXRlX25vcm1hbGl6ZWRfYXQiLCAiZmllbGRfbmFtZSI6ICJfYWlyYnl0ZV9ub3JtYWxpemVkX2F0IiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJfYWlyYnl0ZV9pc3N1ZV9yZWFjdGlvbnNfaGFzaGlkIiwgImZpZWxkX25hbWUiOiAiX2FpcmJ5dGVfaXNzdWVfcmVhY3Rpb25zX2hhc2hpZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjExLjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS41LjMifQAAAAsAAAAYAgAA3AEAALABAACAAQAAVAEAACQBAADwAAAAvAAAAIQAAABIAAAABAAAACT+//8AAAEBEAAAADAAAAAEAAAAAAAAAB8AAABfYWlyYnl0ZV9pc3N1ZV9yZWFjdGlvbnNfaGFzaGlkADD+//9k/v//AAABARAAAAAoAAAABAAAAAAAAAAWAAAAX2FpcmJ5dGVfbm9ybWFsaXplZF9hdAAAaP7//5z+//8AAAEBEAAAACQAAAAEAAAAAAAAABMAAABfYWlyYnl0ZV9lbWl0dGVkX2F0AJz+///Q/v//AAABARAAAAAgAAAABAAAAAAAAAAOAAAAX2FpcmJ5dGVfYWJfaWQAAMz+//8A////AAABARAAAAAgAAAABAAAAAAAAAAMAAAAaXNzdWVfbnVtYmVyAAAAAPz+//8w////AAABARAAAAAcAAAABAAAAAAAAAAKAAAAcmVwb3NpdG9yeQAAKP///1z///8AAAEBEAAAABgAAAAEAAAAAAAAAAQAAAB1c2VyAAAAAFD///+E////AAABARAAAAAcAAAABAAAAAAAAAAKAAAAY3JlYXRlZF9hdAAAfP///7D///8AAAEBEAAAABgAAAAEAAAAAAAAAAcAAABjb250ZW50AKT////Y////AAABARAAAAAYAAAABAAAAAAAAAAHAAAAbm9kZV9pZADM////EAAUAAgABgAHAAwAAAAQABAAAAAAAAEBEAAAABgAAAAEAAAAAAAAAAIAAABpZAAABAAEAAQAAAAAAAAA
Schema:
message schema {
  optional int32 id;
  optional int32 node_id;
  optional int32 content;
  optional int32 created_at;
  optional int32 user;
  optional int32 repository;
  optional int32 issue_number;
  optional int32 _airbyte_ab_id;
  optional int32 _airbyte_emitted_at;
  optional int32 _airbyte_normalized_at;
  optional int32 _airbyte_issue_reactions_hashid;
}

Row group 0:  count: 0  Infinity TB records  start: 0  total(compressed): 253 B total(uncompressed):154 B
--------------------------------------------------------------------------------
                                 type      encodings count     avg size   nulls   min / max
id                               INT32     Z _       0         Infinity TB
node_id                          INT32     Z _       0         Infinity TB
content                          INT32     Z _       0         Infinity TB
created_at                       INT32     Z _       0         Infinity TB
user                             INT32     Z _       0         Infinity TB
repository                       INT32     Z _       0         Infinity TB
issue_number                     INT32     Z _       0         Infinity TB
_airbyte_ab_id                   INT32     Z _       0         Infinity TB
_airbyte_emitted_at              INT32     Z _       0         Infinity TB
_airbyte_normalized_at           INT32     Z _       0         Infinity TB
_airbyte_issue_reactions_hashid  INT32     Z _       0         Infinity TB

Note that the Parquet schema here is present, and the columns are not NULL types. The "pandas_type": "empty" may be introducing some confusion (maybe at arrow-rs level).

gruuya commented 1 year ago

Just debuged in a more detailed manner; what happens is that the ARROW:schema: gets base64 decoded by parquet crate, and it basically contains the same data as shown in pandas: above it. In turn arrow-ipc decides that the column type is Null, which is the route of the issue.

Maybe pandas makes an erroneous encoding somehow (worth seeing whether we can bump the version in Splitgraph). Alternatively, arrow-ipc/parquet should use other information in such case to determine the column type (parquet has access to an object that has the proper schema but it never uses it).

milesrichardson commented 1 year ago

Thanks for debugging that. Your comment makes sense.

I'm in favor of bumping the version of pandas in Splitgraph and seeing if the bug is still present.

gruuya commented 11 months ago

Mitigated by back-porting a fix from pandas 2.x over at Splitgraph.