mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
86 stars 14 forks source link

MongoDB's Decimal128 seems to be returned as fixed_size_binary[16] #203

Open K-to-the-D opened 7 months ago

K-to-the-D commented 7 months ago

Hi,

when I use pymongoarrow.api.aggregate_arrow_all() it seems to return Decimal128 as FixedSizeBinary when context.finish() is called. When looking at the code, my assumption is, it stems from lib.pyx where return pyarrow_wrap_array(out).cast(Decimal128Type_()) in line 784 does not cast the fixed_sized_binary back to Decimal128.

pymongo==4.6.2 pymongoarrow==1.3.0 pyarrow==15.0.1

blink1073 commented 7 months ago

Hi @K-to-the-D, can you please share some example code?

I set a debug point in this test and the resulting data types were:

pyarrow.Table
Int64: int32
float: double
int: int32
datetime: timestamp[ms]
ObjectId: extension<pymongoarrow.objectid<ObjectIdType>>
Decimal128: extension<pymongoarrow.decimal128<Decimal128Type>>
str: string
bool: bool
Binary: extension<pymongoarrow.binary<BinaryType>>
Code: extension<pymongoarrow.code<CodeType>>
K-to-the-D commented 7 months ago

Hi @blink1073,

thanks for the response. You are right, it works for top-level Decimal128. Unfortunately, I have to deal with Objects that contain nested Decimal128 fields.

Example code:

from pymongo import MongoClient
from bson.decimal128 import Decimal128
from pymongoarrow.api import aggregate_arrow_all

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["my_dummy_database"]
collection = db["my_dummy_collection"]

# Insert object with Decimal128
collection.insert_one(
    {
        "name": "Product",
        "price": {
            "net": Decimal128("29.99"),
            "gross": Decimal128("35.99"),
        },
    }
)

pipeline = [
    {"$match": {"price.gross": {"$lt": Decimal128("50.00")}}},
]

# Execute aggregation and retrieve PyArrow Table
arrow_table = aggregate_arrow_all(collection, pipeline)

# Display the result and type
print(f"types:\t{arrow_table["price"].type}")
print(f"values:\t{arrow_table["price"][0]}")
types:  struct<net: fixed_size_binary[16], gross: fixed_size_binary[16]>
values: [('net', b'\xb7\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00<0'), ('gross', b'\x0f\x0e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00<0')]
blink1073 commented 7 months ago

Ah, understood, this will be fixed by https://jira.mongodb.org/browse/ARROW-179.