Using bsonjs for GridFS

CarstVaartjes commented 7 years ago

Hi,

First of all, it looks like an extremely interesting library that could really improve some of the performance issues in pymongo with larger documents! So this might be maybe more of a question maybe than an issue, but this might help other people too.

So normally if I put something into gridfs, i use bson.json_util.dumps to do a conversion:
dict -> json string -> bson binary. From what I read on pymongo, bsonjs is a stand-in for it (with less options but much better performance). However as far as i can see, bsonjs doesn't do exactly the same as json_util but only the
json string -> bson binary So I need to construct the json string first myself, for which i can use bson.json_util or another library like ujson:

%timeit a = bson.json_util.dumps(nested_dict)
1 loop, best of 3: 255 ms per loop

%timeit a = ujson.dumps(nested_dict)
10 loops, best of 3: 24.4 ms per loop

So this makes it about 10x faster, which is great. But I cannot always use ujson (because of ObjectIds) and bsonjs itself falls over encodings in strings which i cannot control with ujson.

Question 1: Unless I have a quick way to generate the json string, there is no real advantage to it? So i need to work with these constraints right?

Question 2: Is there any way to also write raw documents to GridFS like you documented for the document_class? Or does that happen automatically if the used db is in Raw BSON mode?

Question 3: How does it work for (non-GridFS, normal collections) updates? do i need to encode the {'_id': ObjectId('123456789')}, {'$set': {'foo': 'bar'}} as two raw bson documents?

Thanks!

ShaneHarvey commented 7 years ago

Hi @CarstVaartjes, thanks for your interest in using python-bsonjs! I have two questions about your use-case before I give you some advice on using it.

So normally if I put something into gridfs, i use bson.json_util.dumps to do a conversion: dict -> json string -> bson binary.

So you're using GridFS to store arbitrarily large BSON documents to workaround the 16MB document size limit?
If you want to encode a Python dict to bson, I don't think you want to go from dict -> JSON -> BSON. PyMongo lets you encode a dict, or any mapping type, directly to bson using bson.BSON.encode:
```
>>> from bson import BSON
>>> raw_bson = BSON.encode({'my': 'dict'})
>>> raw_bson
b'\x12\x00\x00\x00\x02my\x00\x05\x00\x00\x00dict\x00\x00'
```

CarstVaartjes commented 7 years ago

Thanks for your answer!

1: Yes, basically we have large, complex nested dicts that run over 16MB, we separate the bulk part of the dict into gridfs to work around and keep header information in the normal collection to do lookups/filters etc.
2A. For the collection part: we just use basic pymongo (inserting the dicts). It seems that we could use python-bsonjs there which could speed up things, but next to insert_ones and find_ones, we also have other update_one, update_many and delete_ones. And i'm not sure how I can use that with python-bsonjs 2B: For the gridfs part: we use the bson.json_util loads/dumps but now are switching to ujson with manual conversions of objectids to strings to make sure we do not run into issues; this is saving a lot of time there (still not super-fast, but quite a bit faster then before), i'm not sure of whether bson conversion actually is a significant performance thing here or not (as it's a straight string that gridfs saves)

Thanks you so much for your answer!

ShaneHarvey commented 7 years ago

2B: For the gridfs part: we use the bson.json_util loads/dumps but now are switching to ujson with manual conversions of objectids to strings to make sure we do not run into issues; this is saving a lot of time there (still not super-fast, but quite a bit faster then before), i'm not sure of whether bson conversion actually is a significant performance thing here or not (as it's a straight string that gridfs saves)

So it sounds like your data is represented in memory as a Python dict and you converting that into JSON strings to store into GridFS. Is this the rough process?:

from bson import json_util

# Load JSON document from GridFS
json_str = gridfs_lookup_doc()
large_dict = json_util.loads(json_str)
# Update large dict...

# Store JSON document into GridFS
json_str = json_util.dumps(large_dict)
gridfs_insert_doc(json_str)

CarstVaartjes commented 6 years ago

Hi, I just see that I never answered; in the end we used a conversion to a json string with ujson and a manual replacement of objectids (we know where to find them); that was really fast in the end. However, I just also saw this in pymongo 3.6: http://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=find_raw#pymongo.collection.Collection.find_raw_batches

I can use the raw batches to escape the overhead of the pymongo cursor (with no kidding, around 50% of the time spend in pymongo goes to the cursor itself), but that also piqued my interest in this project again. I see you stopped updating it, but is it still alive? If not, do you know of alternatives?

behackett commented 6 years ago

It's still alive. We just haven't had time to work on it compared to other priorities. We at least want to update it to the latest version of libbson, to support the final version of the extended JSON spec:

https://github.com/mongodb/specifications/blob/master/source/extended-json.rst

The raw batches methods were added for use in the bson-numpy project, to avoid needing to decode BSON to Python dict before building an array.

CarstVaartjes commented 6 years ago

thanks! it would be really interesting, next to the bson translation the cursor itself is a major bottleneck in pymongo (which is a bit weird, as i would expect a generator to be faster than a list operation)

ShaneHarvey commented 6 years ago

I can use the raw batches to escape the overhead of the pymongo cursor (with no kidding, around 50% of the time spend in pymongo goes to the cursor itself)

next to the bson translation the cursor itself is a major bottleneck in pymongo

Can you exand on this a bit more? Are you saying the the Cursor class is spending a lot of time doing something other than network I/O and BSON decoding? That would be surprising.

CarstVaartjes commented 6 years ago

Hi,

{edited it with a nicer example}

it's not as bad as it used to be with older pymongo versions, but still significant. Python 2.7, Mongodb 3.4 (non-sharded, non-replicated) and pymongo 3.6.0. My example code:

from bson import decode_all

def normal_example(db_table, qc=None, qf=None, skip=0, limit=0):
    if not qf:
        qf = {}
    cursor = db_table.find(filter=qf, projection=qc, skip=skip, limit=limit, batch_size=999999999)
    output_list = list(cursor)

def raw_example(db_table, qf=None, qc=None, skip=0, limit=0):
    if not qf:
        qf = {}
    cursor = db_table.find_raw_batches(filter=qf, projection=qc, skip=skip, limit=limit, batch_size=999999999)
    output_list = []
    while True:
        try:
            output_list.extend([x for x in decode_all(cursor.next())])
        except StopIteration:
            break

print(db_table.count())
%timeit normal_example(db_table, qc=qc, skip=0, limit=20000)
%timeit raw_example(db_table, qc=qc, skip=0, limit=20000)

which gives for a small table with large documents (between 1mb-10mb) and getting a single key/value from deep down nested dicts (the qc arg) ->

66284
1 loop, best of 3: 3.66 s per loop
1 loop, best of 3: 3.1 s per loop

Using prun on the 'normal' loop:

100280 function calls (100278 primitive calls) in 6.336 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   16    6.166    0.385    6.166    0.385 {method 'recv' of '_socket.socket' objects}
    1    0.093    0.093    0.093    0.093 {bson._cbson.decode_all}
 20001    0.033    0.000    6.318    0.000 cursor.py:1172(next)
 20000    0.012    0.000    0.012    0.000 database.py:402(_fix_outgoing)
    1    0.010    0.010    6.336    6.336 <string>:1(<module>)
    1    0.008    0.008    6.326    6.326 <ipython-input-13-d31c1c9e48e4>:3(normal_example)
 20004    0.005    0.000    0.005    0.000 collection.py:305(database)
 20026    0.004    0.000    0.004    0.000 {len}
 20000    0.003    0.000    0.003    0.000 {method 'popleft' of 'collections.deque' objects}
    1    0.001    0.001    6.261    6.261 cursor.py:897(__send_message)
    2    0.000    0.000    6.166    3.083 network.py:166(_receive_data_on_socket)
    1    0.000    0.000    0.000    0.000 message.py:953(unpack)
    2    0.000    0.000    6.261    3.131 cursor.py:1059(_refresh)
    1    0.000    0.000    0.000    0.000 cursor.py:112(__init__)

In what we saw before the cursor.py gives significant overhead compared to the raw_batch and decode_all with a list (the third entry in the prun!). This especially happens when we read larger tables with smaller documents (the bson decoding becomes less of a hassle, but the relative performance impact of the cursor can become high). We have seen examples between 0% (no difference) and 50% slower. It's only relevant for larger collections though

behackett commented 6 years ago

Hi. You might want to try again with PyMongo 3.7.0 (just released last week). We made some changes to the networking code that may result in a large performance increase for you.

CarstVaartjes commented 6 years ago

thanks @behackett !! is this about the find_raw_batches or the general find() with the cursor bottleneck?

ShaneHarvey commented 6 years ago

The issue Bernie mentioned is https://jira.mongodb.org/browse/PYTHON-1513. The fix improves PyMongo's performance of reading large messages off of sockets, including find and find_raw_batches. The fix was primarily for Python 3 but Python 2 performance should be better as well. Would you be able to run the benchmark again and post the results comparing PyMongo 3.6.1 and 3.7.0?

ShaneHarvey commented 2 years ago

There hasn't been any recent activity so I'm closing this. Thanks for reaching out! Please feel free to reopen this if we've missed something.

mongodb-labs / python-bsonjs

Using bsonjs for GridFS #12