mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
90 stars 14 forks source link

Deactivating extension types or making ObjectId mapped type polars-compliant? #236

Open Bonnevie opened 1 month ago

Bonnevie commented 1 month ago

I have a use-case where I want to extract data as an arrow table, save as parquet, and then later load it with polars. My problem is that I cannot figure out how to avoid extension types, in particular for ObjectId. Status quo now is that the parquet gets stored with the extension type, which polars cannot read. Pymongoarrow functions like find_polars_all somehow manage this casting, but can it be achieved with find_arrow_all?

Is it possible to specify a datatype for ObjectId fields in the pymongoarrow schema, so that it gets recorded in the arrow table and parquet file as a binary data type that polars can read out of the box?

ShaneHarvey commented 1 month ago

Here are the casts we do internally:

def _cast_away_extension_type(field: pa.field) -> pa.field:
    if isinstance(field.type, pa.ExtensionType):
        field_without_extension = pa.field(field.name, field.type.storage_type)
    elif isinstance(field.type, pa.StructType):
        field_without_extension = pa.field(
            field.name,
            pa.struct([_cast_away_extension_type(nested_field) for nested_field in field.type]),
        )
    elif isinstance(field.type, pa.ListType):
        field_without_extension = pa.field(
            field.name, pa.list_(_cast_away_extension_type(field.type.value_field))
        )
    else:
        field_without_extension = field

    return field_without_extension

def _arrow_to_polars(arrow_table: pa.Table):
    """Helper function that converts an Arrow Table to a Polars DataFrame.

    Note: Polars lacks ExtensionTypes. We cast them  to their base arrow classes.
    """
    if pl is None:
        msg = "polars is not installed. Try pip install polars."
        raise ValueError(msg)

    schema_without_extensions = pa.schema(
        [_cast_away_extension_type(field) for field in arrow_table.schema]
    )
    arrow_table_without_extensions = arrow_table.cast(schema_without_extensions)

    return pl.from_arrow(arrow_table_without_extensions)

def find_polars_all(collection, query, *, schema=None, **kwargs):
    return _arrow_to_polars(find_arrow_all(collection, query, schema=schema, **kwargs))

https://github.com/mongodb-labs/mongo-arrow/blob/eeeb3bff9e458ab3679fcf2f827375b60128053a/bindings/python/pymongoarrow/api.py#L297

We could offer a feature to customize the BSON type conversion that way a cast wouldn't be needed.

Bonnevie commented 1 month ago

Hi @ShaneHarvey, amazing, thanks for the snippet! Just tried to check what I get with an ObjectID extension type if I do field.type.storage_type and I get the pa.binary(12) type as far as I can tell. But if I try to do a schema with {"_id": pa.binary(12)} I get the following error:

Unsupported data type in schema for field "_id" of type "fixed_size_binary[12]"

Is that expected? Should I not be able to specify a schema that casts an ObjectId field directly to its storage type?

This might be what you were getting at with your final comment.

ShaneHarvey commented 1 month ago

That is expected as we have not implemented direct support for pa.binary() yet (https://jira.mongodb.org/browse/ARROW-214). Looking into it more, I believe it should be more straightforward to implement support for fixed size binary so I opened a new ticket here: https://jira.mongodb.org/browse/ARROW-251