Open philszep opened 11 months ago
I ran into the same issue today, I made an upstream issue in delta-rs repo: https://github.com/delta-io/delta-rs/issues/1528
I thought about posting an issue in delta-rs as well, but I thought I saw some issues there about adding support for arrow LargeUtf8
and other data types, so assumed they are at least thinking about addressing it already.
I also felt like it is the duty of the application writing the data to ensure schema consistency on read/write.
The delta transaction protocol doesn't distinguish between Utf8
and LargeUtf8
types -- string columns in parquet files are just byte arrays of arbitrary size anyway.
So the issue is with reading and writing a delta table to/from arrow format, which does distinguish Utf8
vs LargeUtf8
.
I'm not familiar enough with the delta-rs implementation, but perhaps there is a solution in which delta-rs requires an explicit schema when translating a delta table to arrow format, so that the Utf8
and LargeUtf8
formats are both aliases for a string
type in the delta table. So the files defining the delta table don't themselves distinguish Utf8
vs LargeUtf8
(they are both just string
type) but the distinction of which arrow type is needed can still be specified on read by the application -- this way the delta tables are still consistent with the protocol. As it stands I don't think the delta-rs library supports reading delta tables that contain fields that require the arrow LargeUtf8
datatype.
I have encountered the same issue. I wrote a delta-table first in s3 with the following params
data_to_write.write_delta(
target=s3_location,
mode="error",
storage_options={
"AWS_REGION": self.region_name,
"AWS_ACCESS_KEY_ID": self.boto_session.get_credentials().access_key,
"AWS_SECRET_ACCESS_KEY": self.boto_session.get_credentials().secret_key,
"AWS_S3_ALLOW_UNSAFE_RENAME": "true",
},
overwrite_schema=True,
delta_write_options={
"partition_by": [
"ingested_at_year",
"ingested_at_month",
"ingested_at_day",
"ingested_at_hour",
],
"name":"raw_events",
"description":"Events loaded from source bucket",
},
)
On the next run, it fails due to the following error
E ValueError: Schema of data does not match table schema
E Table schema:
E obj_key: large_string
E data: large_string
E ingested_at: timestamp[us, tz=UTC]
E ingested_at_year: int32
E ingested_at_month: uint32
E ingested_at_day: uint32
E ingested_at_hour: uint32
E ingested_at_minute: uint32
E ingested_at_second: uint32
E Data Schema:
E obj_key: string
E data: string
E ingested_at: timestamp[us]
E ingested_at_year: int32
E ingested_at_month: int32
E ingested_at_day: int32
E ingested_at_hour: int32
E ingested_at_minute: int32
E ingested_at_second: int32
No possible solution I've found
@philszep I think you can close it here. It's going to be fixed upstream.
Same issue here btw,
Do you know when it will be fixed upstram?
@edgBR Actually this is a different issue. Can you create one upstream? Then I will look at it, its probably a trivial fix.
Ran into a similar issue implementing write support for Iceberg (#15018)
Example to reproduce: A simple polars dataframe
import polars as pl
df = pl.DataFrame(
{
"foo": [1, 2, 3, 4, 5],
"bar": [6, 7, 8, 9, 10],
"ham": ["a", "b", "c", "d", "e"],
}
)
Dataframe schema:
> df.schema
OrderedDict([('foo', Int64), ('bar', Int64), ('ham', String)])
Arrow schema:
>df.to_arrow().schema
foo: int64
bar: int64
ham: large_string
.to_arrow()
casting string
to large_string
is causing schema mismatch when parquet writer writes.
Not sure why the type is large_string
when casting to Arrow.
pyarrow.large_string
doc says "This data type may not be supported by all Arrow implementations. Unless you need to represent data larger than 2GB, you should prefer string()."
@kevinjqliu I resolved it upstream in delta-rs, with the large_dtypes parameter
Thanks @ion-elgreco I'll take a look at Iceberg's schema handling
@kevinjqliu actually I may even be able to let go of this parameter in delta-rs if I just always convert to lower for schema check :p
there seems to be 2 issues to me
large_string
type gracefullystring
-> large_string
)Looks like in PyIceberg, we're casting large_string
into string
type (link), I'll open an issue for that.
I have no idea why Polars defaults to large_string
when converting to Arrow (link)
Polars isn't changing from string to large_string when it converts to arrow. It doesn't use string
, it only uses large_string
so for brevity it simply names its own dtype String
even though it is backed by arrow's large_string
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Outputs a
DeltaError
:In this case, if you look at the delta table, it has two parquet files. In the first parquet file the
this
field is of typelarge_string
whereas in the second thethis
field is of typestring
.Issue description
There is an invalid schema generated when creating a new delta table. This has to do with delta lake not distinguishing between arrow datatypes
Utf8
andLargeUtf8
.I believe this is caused by these lines 3307-3314 of frame.py. See pull request #7616.
There, it relies on an existing table to fix the schema to be consistent with a delta table schema. To remedy this we can cast the existing
data.schema
object to a deltalake schema object and back, for example, I think if we replace the code in frame.py referenced above with:then the problem will be resolved for any table that is created.
Expected behavior
New delta table created with valid deltalake schema.
Installed versions