Open Sondos-Omar opened 1 year ago
Hi @Sondos-Omar, could you please provide some sample input and output of the current behavior vs the behavior you would expect?
Hi @ShaneHarvey , thank you for your reply, here is a sample of the output. Loading this data using collections.find followed by appending the dataset to pandas in batches works and loads the expected data (the dates, names and ids). But this takes so much space and time. Double checked the schema on mongodb compass, the prop is object and the prop.Start is datetime and the rest are strings.
Hi! Thank you for raising this issue. This is unfortunately because Schemas do not support the "dot" notation used in MongoDB projections. Unfortunately, at this time the best workaround seems to be to flatten the data before ingesting into PyMongoArrow by using an aggregate pipeline. I opened a ticket for the implementation of a real solution to this issue, and a ticket here for us to update our documentation with the correct workaround.
An example aggregation pipeline you can use would be:
db.collection.aggregate([
{
"$project": {
"Name": "$value.prop.Name",
"Start": "$value.prop.Start"
}
},
])
Which would yield something that looks like this:
[
{
"Name": "foo",
"Start": <datetime>,
"_id": ObjectId("5a934e000102030405000000")
}
]
@Sondos-Omar here is a more detailed example:
from pymongo import MongoClient
from pymongoarrow.api import Schema, find_pandas_all
from pymongoarrow.types import (
ObjectIdType,
)
from datetime import datetime
from bson import datetime_ms
coll = MongoClient(username="user", password="password").db.coll
coll.drop()
start_date, end_date = datetime_ms._millis_to_datetime(0, coll.codec_options),datetime_ms._millis_to_datetime(10, coll.codec_options)
coll.insert_many([{
"prop": {
"Name": "foo",
"Start": start_date,
}
}, {
"prop": {
"Name": "foo",
"Start": end_date,
}
},])
# The code below is likely all that you will need, the code above is just setting up the Database so it contains the right kind of data.
df=find_pandas_all(coll,
{"prop.Start": {'$gte':start_date,'$lte':end_date,}},
projection={
"Name": "$prop.Name",
"Start": "$prop.Start"
},
schema=Schema({"_id": ObjectIdType(), "Start": datetime, "Name": str}))
print(df)
>>>
_id Start Name
0 b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95c' 1970-01-01 00:00:00.000 foo
1 b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95d' 1970-01-01 00:00:00.010 foo
Hi! @Sondos-Omar We have updated our documentation in this PR to show more examples for using nested data: https://github.com/mongodb-labs/mongo-arrow/pull/130
I was trying mongo arrow to load a dataset from mongodb, it is loading the selected columns only that's saving space, but the dataframe is all Nat and Nones only. Is this a common issue and how to fix that? Thanks in advance
df=collection.find_pandas_all( { "prop.Start": {'$gte':start_date,'$lte':end_date}} , schema=Schema({ 'prop.Start': datetime, 'prop.Name':str, '_id.objectId':str }))