mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Inefficiency in current version of Database.read_data #445

Open pavlis opened 1 year ago

pavlis commented 1 year ago

The current version of our read_data method has a built in inefficiency that I am pretty certain is unnecessary. Here is the block of code that handles arg0 of the method:

# find the corresponding document according to object id
        col = self[wf_collection]
        try:
            oid = object_id["_id"]
        except:
            oid = object_id
        object_doc = col.find_one({"_id": oid})
        if not object_doc:
            return None

For reference this is lines 481-489 of the current database.py file.

I am in the process of cleaning up Database and I have rewritten the above lines as follows:

        col = self[wf_collection]
        # This logic allows object_id to be either a document or 
        # an object id
        if isinstance(doc_or_id,dict):
            object_doc = doc_or_id
        elif isinstance(doc_or_id,ObjectID):
            oid = doc_or_id
            object_doc = col.find_one({"_id": oid})
            if not object_doc:
                return None
        else:
            message = "Database.read_data:  arg0 has unsupported type={}\n".format(str(type(doc_or_id)))

This revision is superior on multiple levels, but the most important one is the old logic ALWAYS caused this function to call the find_one collection method of pymongo. That was an unnecessary inefficiency because most times I've used read data it was in a loop like this:

cursor=db.wf_miniseed.find(query)
for doc in cursor:
    d = db.read_data(doc)

That is, in this case arg0 would be a python dictionary. The original code did something that worked but likely cost a lot of time. That is, it extracted the object id from the dictionary and then called find_one to produce a copy of the data passed through the (original) object_id argument.

Case in point why I need to do this surgery on Database. @wangyinz please tell me if I'm missing something here I don't recall from the earlier development of the read_data method. If you concur, leave a comment, close this issue, and the new implementation I'm beating on will this change included.

pavlis commented 1 year ago

Just realized read_ensemble_data has the same problem in spades since it currently does a loop over an cursor. That cause a transaction for every member read, which is really unnecessary. The new algorithm will take a cursor, load a list of docs with the cursor, and then read the members with the atomic reader. With that revision there will be no calls to find_one in the replacement for read_ensemble_data.

BTW - I am planning to deprecate read_ensemble_data and we will have only one read_data method. Future calls to read_ensemble_data will work but generate a warning message.

wangyinz commented 1 year ago

I think the changes here make sense. The only caveat is that in the new code, it explicitly checks for type dict or ObjectID, but I am not sure how it behaves for other dict alike objects (e.g., Metadata). The only implementation can work on any types as long as it has the [ ] operator defined. Maybe that's why it was written that way.

pavlis commented 1 year ago

Do we need that functionality? It complicates a lot of things and seems unnecessary since in pymongo a "document" is a python dictionary. I once thought otherwise, but I was wrong. pymongo translates whatever MongoDB actually uses to a dict. Run any version of the following with the right collection name and you can see that is always true:

doc = db.collection.find_one()
print(type(doc))

as the documentation says as well. We have no examples where we needed to support anything else other than a cursor, which is not purely pythonic. A cursor is iterable and yields the equivalent of a list of documents (dict containers) but it isn't actually at all a list.