mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
90 stars 14 forks source link

aggregate_arrow_all does not return column of fields with "null" values only #225

Open K-to-the-D opened 3 months ago

K-to-the-D commented 3 months ago

Hi, when using pymongoarrow.api.aggregate_arrow_all() it seems to omit columns that would contain only null values.

Field "email" with None only

data = [
    {"name": "Charlie", "email": None},
    {"name": "Eve", "email": None},
]
PyMongoArrow result:
 [{'_id': ObjectId('66a36acc11ce1209ca0bfcf8'), 'name': 'Charlie'}, {'_id': ObjectId('66a36acc11ce1209ca0bfcf9'), 'name': 'Eve'}]
PyMongo result:
 [{'_id': ObjectId('66a36acc11ce1209ca0bfcf8'), 'name': 'Charlie', 'email': None}, {'_id': ObjectId('66a36acc11ce1209ca0bfcf9'), 'name': 'Eve', 'email': None}]

PyMongoArrow result contains field 'name' but is missing field "email".

Field "email" with None and empty string

data = [
    {"name": "Charlie", "email": None},
    {"name": "Eve", "email": ""},
]
PyMongoArrow result:
 [{'_id': ObjectId('66a3689f75fbe1b2bef04931'), 'name': 'Charlie', 'email': None}, {'_id': ObjectId('66a3689f75fbe1b2bef04932'), 'name': 'Eve', 'email': ''}]
PyMongo result:
 [{'_id': ObjectId('66a3689f75fbe1b2bef04931'), 'name': 'Charlie', 'email': None}, {'_id': ObjectId('66a3689f75fbe1b2bef04932'), 'name': 'Eve', 'email': ''}]

PyMongoArrow result contains 'name' and 'email' fields.

Code used for this example:

from pymongo import MongoClient
from pymongoarrow.api import aggregate_arrow_all

data = [
    {"name": "Charlie", "email": None},
    {"name": "Eve", "email": None},
]

# Insert data
client = MongoClient("mongodb://localhost:27017/")
db = client["my_dummy_database"]
collection = db["my_dummy_collection"]
collection.insert_many(data)

# Retrieve results        
pipeline = [{"$match": {"email": {"$exists": True}}}]
result_arrow = aggregate_arrow_all(collection, pipeline)
result_regular = collection.aggregate(pipeline)

print("PyMongoArrow result:\n", result_arrow.to_pylist())
print("PyMongo result:\n", list(result_regular))
caseyclements commented 3 months ago

Thanks for reporting this bug @K-to-the-D@ This has to do with the auto schema, and hopefully straightforward to fix given Arrow's null type