mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
86 stars 14 forks source link

Dataframe is all Nat and None after loading #127

Open Sondos-Omar opened 1 year ago

Sondos-Omar commented 1 year ago

I was trying mongo arrow to load a dataset from mongodb, it is loading the selected columns only that's saving space, but the dataframe is all Nat and Nones only. Is this a common issue and how to fix that? Thanks in advance

df=collection.find_pandas_all( { "prop.Start": {'$gte':start_date,'$lte':end_date}} , schema=Schema({ 'prop.Start': datetime, 'prop.Name':str, '_id.objectId':str }))

ShaneHarvey commented 1 year ago

Hi @Sondos-Omar, could you please provide some sample input and output of the current behavior vs the behavior you would expect?

Sondos-Omar commented 1 year ago

Hi @ShaneHarvey , thank you for your reply, here is a sample of the output. Loading this data using collections.find followed by appending the dataset to pandas in batches works and loads the expected data (the dates, names and ids). But this takes so much space and time. Double checked the schema on mongodb compass, the prop is object and the prop.Start is datetime and the rest are strings. image

juliusgeo commented 1 year ago

Hi! Thank you for raising this issue. This is unfortunately because Schemas do not support the "dot" notation used in MongoDB projections. Unfortunately, at this time the best workaround seems to be to flatten the data before ingesting into PyMongoArrow by using an aggregate pipeline. I opened a ticket for the implementation of a real solution to this issue, and a ticket here for us to update our documentation with the correct workaround.

juliusgeo commented 1 year ago

An example aggregation pipeline you can use would be:

db.collection.aggregate([
  {
    "$project": {
      "Name": "$value.prop.Name",
      "Start": "$value.prop.Start"
    }
  },
])

Which would yield something that looks like this:

[
  {
    "Name": "foo",
    "Start": <datetime>,
    "_id": ObjectId("5a934e000102030405000000")
  }
]
juliusgeo commented 1 year ago

@Sondos-Omar here is a more detailed example:

from pymongo import MongoClient
from pymongoarrow.api import Schema, find_pandas_all
from pymongoarrow.types import (
    ObjectIdType,
)
from datetime import datetime
from bson import datetime_ms
coll = MongoClient(username="user", password="password").db.coll
coll.drop()
start_date, end_date = datetime_ms._millis_to_datetime(0, coll.codec_options),datetime_ms._millis_to_datetime(10, coll.codec_options)
coll.insert_many([{
      "prop": {
        "Name": "foo",
        "Start": start_date,
    }
  }, {
      "prop": {
        "Name": "foo",
        "Start": end_date,
    }
  },])

# The code below is likely all that you will need, the code above is just setting up the Database so it contains the right kind of data.
df=find_pandas_all(coll,
{"prop.Start": {'$gte':start_date,'$lte':end_date,}},
projection={
      "Name": "$prop.Name",
      "Start": "$prop.Start"
},
schema=Schema({"_id": ObjectIdType(), "Start": datetime, "Name": str}))
print(df)
>>>
                                             _id                   Start Name
0  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95c' 1970-01-01 00:00:00.000  foo
1  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95d' 1970-01-01 00:00:00.010  foo
juliusgeo commented 1 year ago

Hi! @Sondos-Omar We have updated our documentation in this PR to show more examples for using nested data: https://github.com/mongodb-labs/mongo-arrow/pull/130