mongodb-labs / python-bsonjs

A fast BSON to MongoDB Extended JSON converter for Python - This Repository is NOT a supported MongoDB product
Apache License 2.0
40 stars 10 forks source link

insert_many not working #14

Closed dingding72 closed 4 years ago

dingding72 commented 5 years ago

i was experimenting with bsonjs. insert_one works fine, however when I tried insert_many, I got the following error msg: "File "C:\ProgramData\Anaconda2\lib\site-packages\pymongo\common.py", line 453, in validate_is_document_type "collections.MutableMapping" % (option,)) TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping".

I casted "rawBS1 = RawBSONDocument(bson_bytes)" just the line before, and worked fine with insert_one.

ShaneHarvey commented 5 years ago

I think the problem your running into is that you're passing a single RawBSONDocument to insert_many instead of a list of documents. We give a helpful error when the documents argument is a single dict, SON, or OrderedDict:

>>> client.test.test.insert_many({})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pymongo/collection.py", line 739, in insert_many
    raise TypeError("documents must be a non-empty list")
TypeError: documents must be a non-empty list
>>> client.test.test.insert_many(SON())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pymongo/collection.py", line 739, in insert_many
    raise TypeError("documents must be a non-empty list")
TypeError: documents must be a non-empty list
>>> client.test.test.insert_many(OrderedDict())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pymongo/collection.py", line 739, in insert_many
    raise TypeError("documents must be a non-empty list")
TypeError: documents must be a non-empty list

However when passing a single RawBSONDocument to insert_many we get this unhelpful error:

>>> client.test.test.insert_many(RawBSONDocument(bson.BSON.encode({'_id':2})))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pymongo/collection.py", line 753, in insert_many
    blk.ops = [doc for doc in gen()]
  File "pymongo/collection.py", line 744, in gen
    common.validate_is_document_type("document", document)
  File "pymongo/common.py", line 453, in validate_is_document_type
    "collections.MutableMapping" % (option,))
TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping

I opened https://jira.mongodb.org/browse/PYTHON-1690 to fix the exception in this case. But I don't think there's any other bug here. insert_many works when given a list of RawBSONDocuments:

>>> import bson
>>> from bson.raw_bson import RawBSONDocument
>>> docs = [{'_id':1}, RawBSONDocument(bson.BSON.encode({'_id':2}))]
>>> docs
[{'_id': 1}, RawBSONDocument('\x0e\x00\x00\x00\x10_id\x00\x02\x00\x00\x00\x00', codec_options=CodecOptions(document_class=<class 'bson.raw_bson.RawBSONDocument'>, tz_aware=False, uuid_representation=PYTHON_LEGACY, unicode_decode_error_handler='strict', tzinfo=None))]
>>> client.test.test.insert_many(docs)
<pymongo.results.InsertManyResult object at 0x106c32758>
>>> list(client.test.test.find())
[{u'_id': 1}, {u'_id': 2}]
dingding72 commented 5 years ago

Hi, Shane, thank you very much! It works beautifully now! I am going to load all my data for the next few days, all in RawBSONDocuments and hopefully I can see a big performance improvement. Thanks!

dingding72 commented 5 years ago

Hi, Shane:

I have a way to convert the result mongodb cursor back to pandas dataframe. The speed is ok, about 1 second for 100K rows (documents) with 20+ columns (query 20+million documents only took < 0.2 seconds in mongodb) but I didn't use bsonjs's dumps. Just wondering what's your suggestion/best practices/fastest approach to convert the cursor to dataframe?

Thanks!

ShaneHarvey commented 5 years ago

To convert the cursor to a dataframe it may be faster to use BSON-NumPy (https://bson-numpy.readthedocs.io/en/latest/) to convert the cursor to a NumPy array. Then you could use the array instead of a dataframe or convert the array into a dataframe.

Please let me know how either of these options work!