Open pavlis opened 1 year ago
Just realized read_ensemble_data
has the same problem in spades since it currently does a loop over an cursor. That cause a transaction for every member read, which is really unnecessary. The new algorithm will take a cursor, load a list of docs with the cursor, and then read the members with the atomic reader. With that revision there will be no calls to find_one in the replacement for read_ensemble_data
.
BTW - I am planning to deprecate read_ensemble_data
and we will have only one read_data
method. Future calls to read_ensemble_data will work but generate a warning message.
I think the changes here make sense. The only caveat is that in the new code, it explicitly checks for type dict
or ObjectID
, but I am not sure how it behaves for other dict alike objects (e.g., Metadata
). The only implementation can work on any types as long as it has the [ ]
operator defined. Maybe that's why it was written that way.
Do we need that functionality? It complicates a lot of things and seems unnecessary since in pymongo a "document" is a python dictionary. I once thought otherwise, but I was wrong. pymongo translates whatever MongoDB actually uses to a dict. Run any version of the following with the right collection name and you can see that is always true:
doc = db.collection.find_one()
print(type(doc))
as the documentation says as well. We have no examples where we needed to support anything else other than a cursor, which is not purely pythonic. A cursor is iterable and yields the equivalent of a list of documents (dict containers) but it isn't actually at all a list.
The current version of our read_data method has a built in inefficiency that I am pretty certain is unnecessary. Here is the block of code that handles arg0 of the method:
For reference this is lines 481-489 of the current database.py file.
I am in the process of cleaning up Database and I have rewritten the above lines as follows:
This revision is superior on multiple levels, but the most important one is the old logic ALWAYS caused this function to call the
find_one
collection method of pymongo. That was an unnecessary inefficiency because most times I've used read data it was in a loop like this:That is, in this case arg0 would be a python dictionary. The original code did something that worked but likely cost a lot of time. That is, it extracted the object id from the dictionary and then called find_one to produce a copy of the data passed through the (original)
object_id
argument.Case in point why I need to do this surgery on Database. @wangyinz please tell me if I'm missing something here I don't recall from the earlier development of the read_data method. If you concur, leave a comment, close this issue, and the new implementation I'm beating on will this change included.