Closed mildbyte closed 11 months ago
Note that CREATE EXTERNAL TABLE
statement goes through, the error occurs when trying to create an actual delta table (i.e. materialize it from the external table):
$ curl -i -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
{"query": "CREATE EXTERNAL TABLE test STORED AS PARQUET LOCATION '/Users/markogrujic/Downloads/issue_reactions.parquet'"}
EOF
HTTP/1.1 200 OK
content-type: application/octet-stream
vary: Authorization, Content-Type, Origin, X-Seafowl-Query
content-length: 0
date: Tue, 30 May 2023 08:23:21 GMT
$ curl -i -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
{"query": "CREATE TABLE test AS SELECT * FROM staging.test"}
EOF
HTTP/1.1 400 Bad Request
vary: Authorization, Content-Type, Origin, X-Seafowl-Query
content-length: 65
date: Tue, 30 May 2023 08:23:46 GMT
Arrow error: Schema error: Invalid data type for Delta Lake: Null
The reason is that delta-rs does not handle arrow_schema::datatype::DataType::Null
; in fact, a proper type for this case seems to be lacking in the Delta protocol itself: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#schema-serialization-format
It sounds like this might not be easily fixable. But is it at least detectable? I think it would be helpful if the error message contained some specific indication that the error was due to the table being empty.
Alternatively/additionally (and separately to any issue here), we could have Splitgraph forward the same error to the user to indicate the reason (atm you just see a failed export and need to check the logs). Or Splitgraph could return an error if the query is empty, before attempting to import the empty result into Seafowl.
Maybe the ideal behavior would be to replace the import with a CREATE TABLE
statement that creates the table with a schema matching the (expected?) schema of the (empty) query result, since we presumably have that available at the time of exporting from Splitgraph (we have the schema for all tables, and either we're exporting a table itself with SELECT *
, or we're exporting the result of some query that references tables for which we know the schema).
Note that CREATE EXTERNAL TABLE statement goes through, the error occurs when trying to create an actual delta table (i.e. materialize it from the external table):
Or maybe the fix is to simply short-circuit the import process to stop after CREATE EXTERNAL TABLE
, and not return an error, since technically the table was created with the correct schema, and the error happened when trying to load (non-existent) rows into it. (IIUC)
I guess this might also be a fix that should be applied to Splitgraph export code (which calls CREATE EXTERNAL TABLE
) rather than to Seafowl itself?
But is it at least detectable?
It seems it is; I could make the external table creation itself fail in such a case.
Though come to think of it, I think there's some weird compatibility problem here, since this particular file was created through pandas Python lib:
$ parquet meta /Users/markogrujic/Downloads/issue_reactions.parquet
File path: /Users/markogrujic/Downloads/issue_reactions.parquet
Created by: parquet-cpp-arrow version 11.0.0
Properties:
pandas: {"index_columns": [], "column_indexes": [], "columns": [{"name": "id", "field_name": "id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "node_id", "field_name": "node_id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "content", "field_name": "content", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "created_at", "field_name": "created_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "user", "field_name": "user", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "repository", "field_name": "repository", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "issue_number", "field_name": "issue_number", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_ab_id", "field_name": "_airbyte_ab_id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_emitted_at", "field_name": "_airbyte_emitted_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_normalized_at", "field_name": "_airbyte_normalized_at", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "_airbyte_issue_reactions_hashid", "field_name": "_airbyte_issue_reactions_hashid", "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "1.5.3"}
ARROW:schema: /////4AIAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAwGAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAANUFAAB7ImluZGV4X2NvbHVtbnMiOiBbXSwgImNvbHVtbl9pbmRleGVzIjogW10sICJjb2x1bW5zIjogW3sibmFtZSI6ICJpZCIsICJmaWVsZF9uYW1lIjogImlkIiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJub2RlX2lkIiwgImZpZWxkX25hbWUiOiAibm9kZV9pZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiY29udGVudCIsICJmaWVsZF9uYW1lIjogImNvbnRlbnQiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImNyZWF0ZWRfYXQiLCAiZmllbGRfbmFtZSI6ICJjcmVhdGVkX2F0IiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ1c2VyIiwgImZpZWxkX25hbWUiOiAidXNlciIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicmVwb3NpdG9yeSIsICJmaWVsZF9uYW1lIjogInJlcG9zaXRvcnkiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImlzc3VlX251bWJlciIsICJmaWVsZF9uYW1lIjogImlzc3VlX251bWJlciIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiX2FpcmJ5dGVfYWJfaWQiLCAiZmllbGRfbmFtZSI6ICJfYWlyYnl0ZV9hYl9pZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiX2FpcmJ5dGVfZW1pdHRlZF9hdCIsICJmaWVsZF9uYW1lIjogIl9haXJieXRlX2VtaXR0ZWRfYXQiLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIl9haXJieXRlX25vcm1hbGl6ZWRfYXQiLCAiZmllbGRfbmFtZSI6ICJfYWlyYnl0ZV9ub3JtYWxpemVkX2F0IiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJfYWlyYnl0ZV9pc3N1ZV9yZWFjdGlvbnNfaGFzaGlkIiwgImZpZWxkX25hbWUiOiAiX2FpcmJ5dGVfaXNzdWVfcmVhY3Rpb25zX2hhc2hpZCIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjExLjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS41LjMifQAAAAsAAAAYAgAA3AEAALABAACAAQAAVAEAACQBAADwAAAAvAAAAIQAAABIAAAABAAAACT+//8AAAEBEAAAADAAAAAEAAAAAAAAAB8AAABfYWlyYnl0ZV9pc3N1ZV9yZWFjdGlvbnNfaGFzaGlkADD+//9k/v//AAABARAAAAAoAAAABAAAAAAAAAAWAAAAX2FpcmJ5dGVfbm9ybWFsaXplZF9hdAAAaP7//5z+//8AAAEBEAAAACQAAAAEAAAAAAAAABMAAABfYWlyYnl0ZV9lbWl0dGVkX2F0AJz+///Q/v//AAABARAAAAAgAAAABAAAAAAAAAAOAAAAX2FpcmJ5dGVfYWJfaWQAAMz+//8A////AAABARAAAAAgAAAABAAAAAAAAAAMAAAAaXNzdWVfbnVtYmVyAAAAAPz+//8w////AAABARAAAAAcAAAABAAAAAAAAAAKAAAAcmVwb3NpdG9yeQAAKP///1z///8AAAEBEAAAABgAAAAEAAAAAAAAAAQAAAB1c2VyAAAAAFD///+E////AAABARAAAAAcAAAABAAAAAAAAAAKAAAAY3JlYXRlZF9hdAAAfP///7D///8AAAEBEAAAABgAAAAEAAAAAAAAAAcAAABjb250ZW50AKT////Y////AAABARAAAAAYAAAABAAAAAAAAAAHAAAAbm9kZV9pZADM////EAAUAAgABgAHAAwAAAAQABAAAAAAAAEBEAAAABgAAAAEAAAAAAAAAAIAAABpZAAABAAEAAQAAAAAAAAA
Schema:
message schema {
optional int32 id;
optional int32 node_id;
optional int32 content;
optional int32 created_at;
optional int32 user;
optional int32 repository;
optional int32 issue_number;
optional int32 _airbyte_ab_id;
optional int32 _airbyte_emitted_at;
optional int32 _airbyte_normalized_at;
optional int32 _airbyte_issue_reactions_hashid;
}
Row group 0: count: 0 Infinity TB records start: 0 total(compressed): 253 B total(uncompressed):154 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
id INT32 Z _ 0 Infinity TB
node_id INT32 Z _ 0 Infinity TB
content INT32 Z _ 0 Infinity TB
created_at INT32 Z _ 0 Infinity TB
user INT32 Z _ 0 Infinity TB
repository INT32 Z _ 0 Infinity TB
issue_number INT32 Z _ 0 Infinity TB
_airbyte_ab_id INT32 Z _ 0 Infinity TB
_airbyte_emitted_at INT32 Z _ 0 Infinity TB
_airbyte_normalized_at INT32 Z _ 0 Infinity TB
_airbyte_issue_reactions_hashid INT32 Z _ 0 Infinity TB
Note that the Parquet schema here is present, and the columns are not NULL types. The "pandas_type": "empty"
may be introducing some confusion (maybe at arrow-rs
level).
Just debuged in a more detailed manner; what happens is that the ARROW:schema:
gets base64 decoded by parquet
crate, and it basically contains the same data as shown in pandas:
above it. In turn arrow-ipc
decides that the column type is Null, which is the route of the issue.
Maybe pandas makes an erroneous encoding somehow (worth seeing whether we can bump the version in Splitgraph). Alternatively, arrow-ipc
/parquet
should use other information in such case to determine the column type (parquet
has access to an object that has the proper schema but it never uses it).
Thanks for debugging that. Your comment makes sense.
I'm in favor of bumping the version of pandas in Splitgraph and seeing if the bug is still present.
Using this Parquet file: issue_reactions.parquet.tar.gz (from a Splitgraph DDN export of an empty query result)
results in this error: