mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
92 stars 14 forks source link

loading an Int64 with a schema that says Int32 raises an OverflowError #218

Open sibbiii opened 5 months ago

sibbiii commented 5 months ago

Hi,

This might look like a stupid bug report at first glance, but let me explain:

Assume a master service is reading data from MongoDB where the data is written into MongoDB by other services. By design, one of the cool things of MongoDB ist that it can work schemaless (i know you can enforce a schema).

Several services write data to MongoDB: collection.insert_one({'data_to_test': 42})

and some master service reads this data: pymongoarrow.api.aggregate_arrow_all(collection, [], schema=pymongoarrow.api.Schema({'data_to_test': pyarrow.int32()}))

this works absolutely fine. And even if some service writes a sting (or ObjectID, or datetime, or ...) to this field: collection.insert_one({'data_to_test': 'a string'})

the master service just receives a 'null' and all is fine.

But then, one day the master service completely breaks because one service wrote an Int64 to this field: temp_collection._collection.insert_one({'data_to_test': 1_000_000_000_000})

now the master service does not get a null. pymongoarrow.api.aggregate_arrow_all raises with OverflowError: value too large to convert to int32_t.

I have now written me acceptance tests for all possible combinations of data in MonoDB and reading them with any schema, e.g. there is an int in the database, and you read with a schema that says: string. All combinations work fine (setting the value to null on type mismatch is fine). The only combination just breaks everything is:

collection.insert_one({'data_to_test': 1_000_000_000_000})
pymongoarrow.api.aggregate_arrow_all(collection, [], schema=pymongoarrow.api.Schema({'data_to_test': pyarrow.int32()}))

I consider this a "bug", because I cannot read any Int32 data if there is a single Int64 in the database. My solution now is to always read as Int64 and then downcast, but is this really how it should be?

Ps.: Its not a showstopper once you know that reading as Int32 is a nogo if the schema is not enforced, but its kind of surprising that the reading raises as all other combinations work fine.

keanamo commented 5 months ago

Hi @sibbiii, I've created a ticket to track this request https://jira.mongodb.org/browse/PYTHON-4519