mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
92 stars 14 forks source link

ARROW-238 Add Support for nested ObjectIDs in polars conversion #220

Closed sibbiii closed 3 months ago

sibbiii commented 5 months ago

Hi,

_arrow_to_polars currently has no support to cast extension types for nested fields. This prohibits ObjectIDs to be read in case they are in nested fields.

I could not manage the conversion with the original code, but I found a way to using arrow_table_without_extensions = arrow_table.cast(schema_without_extensions) to cast the schema of the whole table in one go.

The schema_without_extensions is created recursively from the old schema. Support for lists is still to be added, should not be that hard, maybe I try tomorrow.

I am not an expert in apache arrow. My world is Pandas and Polars. I have wrote some unit tests locally to test the code, but I do not feel confident that I have not overlooked something, so please review carefully.

219

caseyclements commented 5 months ago

Thank you for you submission. It looks good to me. We are waiting on Polars to support ExtensionTypes, but in the meantime, I don't see why we wouldn't add this. I cannot recall why we commented out the list and struct cases before. Please give us a few days to review.

Here is the link to the mongo-arrow task: https://jira.mongodb.org/browse/ARROW-202. It contains links to the Polars issues.

caseyclements commented 4 months ago

Hi @sibbiii . I'm sorry for the delay. I've been very busy. Would you please add a couple tests of this new functionality?

lazargugleta commented 4 months ago

Hey @caseyclements , I extended the existing test for _arrow_to_polars with lists and structs. Feel free to let me know if you need anything else.