mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
88 stars 14 forks source link

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

Open DataEnggNerd opened 1 month ago

DataEnggNerd commented 1 month ago

While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.

MongoDB documentation

[
    {
        "name": "test",
        "code": "1"
    },
    {
        "name": "test",
        "code": 1
    }
]

Current implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘

In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.

Expected implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query,
            coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# └─────────────────────────────────┴──────┴──────┘

Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field

aclark4life commented 1 month ago

Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252

DataEnggNerd commented 2 weeks ago

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

aclark4life commented 2 weeks ago

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Yes! Are you able to send a PR with the proposed changes?

DataEnggNerd commented 1 week ago

@aclark4life I would like to discuss the design before getting into implementation. In Jira I have observed that there is a suggestion of a new data type, which I am fine with. But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?

Any help is appreciated.

aclark4life commented 1 week ago

No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.