Open Bonnevie opened 1 month ago
Here are the casts we do internally:
def _cast_away_extension_type(field: pa.field) -> pa.field:
if isinstance(field.type, pa.ExtensionType):
field_without_extension = pa.field(field.name, field.type.storage_type)
elif isinstance(field.type, pa.StructType):
field_without_extension = pa.field(
field.name,
pa.struct([_cast_away_extension_type(nested_field) for nested_field in field.type]),
)
elif isinstance(field.type, pa.ListType):
field_without_extension = pa.field(
field.name, pa.list_(_cast_away_extension_type(field.type.value_field))
)
else:
field_without_extension = field
return field_without_extension
def _arrow_to_polars(arrow_table: pa.Table):
"""Helper function that converts an Arrow Table to a Polars DataFrame.
Note: Polars lacks ExtensionTypes. We cast them to their base arrow classes.
"""
if pl is None:
msg = "polars is not installed. Try pip install polars."
raise ValueError(msg)
schema_without_extensions = pa.schema(
[_cast_away_extension_type(field) for field in arrow_table.schema]
)
arrow_table_without_extensions = arrow_table.cast(schema_without_extensions)
return pl.from_arrow(arrow_table_without_extensions)
def find_polars_all(collection, query, *, schema=None, **kwargs):
return _arrow_to_polars(find_arrow_all(collection, query, schema=schema, **kwargs))
We could offer a feature to customize the BSON type conversion that way a cast wouldn't be needed.
Hi @ShaneHarvey, amazing, thanks for the snippet!
Just tried to check what I get with an ObjectID extension type if I do field.type.storage_type
and I get the pa.binary(12)
type as far as I can tell. But if I try to do a schema with {"_id": pa.binary(12)}
I get the following error:
Unsupported data type in schema for field "_id" of type "fixed_size_binary[12]"
Is that expected? Should I not be able to specify a schema that casts an ObjectId field directly to its storage type?
This might be what you were getting at with your final comment.
That is expected as we have not implemented direct support for pa.binary() yet (https://jira.mongodb.org/browse/ARROW-214). Looking into it more, I believe it should be more straightforward to implement support for fixed size binary so I opened a new ticket here: https://jira.mongodb.org/browse/ARROW-251
I have a use-case where I want to extract data as an arrow table, save as parquet, and then later load it with polars. My problem is that I cannot figure out how to avoid extension types, in particular for ObjectId. Status quo now is that the parquet gets stored with the extension type, which polars cannot read. Pymongoarrow functions like
find_polars_all
somehow manage this casting, but can it be achieved withfind_arrow_all
?Is it possible to specify a datatype for ObjectId fields in the pymongoarrow schema, so that it gets recorded in the arrow table and parquet file as a binary data type that polars can read out of the box?