Feature: V1.707 Updates to PdbxLoader and RepoLoadWorkflow to skip brute force document purge by default

piehld commented 1 year ago

This PR introduces a new "forcePurge" flag which defaults to False, to cause the loader to skip the brute force document purge step, which introduces a significant slow-down likely due to the use of a regex filter for searching through and deleting documents. Also, in any case, if loadType=="replace", there is a second more efficient document deletion step that's already taking place in MongoDbUtil.deleteList() method called by __loadDocuments().

So, up to now, if loadType=="replace", the loader was spending time trying to delete pre-existing documents twice, the first of which was very slow (__purgeDocuments()), and the second of which (MongoDbUtil.deleteList()) is very fast.

Apparently, John added the brute force purge step some time ago, but the reason for it is unclear. This is why I opted to keep the functionality in the code but add a flag to control whether to run it or not.

From my testing, I didn't observe any differences between using the purge method vs. skipping it. Also, impressively, the load speed when using this new approach (with loadType=="replace" and skipping the original purge step) turns out to be just as fast as running it with loadType=="full"! This means we should be able to just set loadType=="replace" in the weekly-update workflow, without having to worry about some magic threshold of the percent of structures already loaded.

N.B.: I postdated the HISTORY and comments since I don't plan to merge this until at least that date.

aliciaaevans commented 1 year ago

Actual changes look good, and I think it makes sense to leave the force option available until it's determined there are no issues.

MongoDB documentation says regex prefix expression on an indexed field can actually be fast, but only if it's case sensitive. https://www.mongodb.com/docs/manual/reference/operator/query/regex/#index-use So if for some reason the purge is needed, is there a way to make sure it uses the correct case for the entry IDs? I would have thought the case in the database would be consistent. That might be a good middle ground option.

Speculation about why the purge is there in the first place:

Is it possible certain types of errors were preventing the *_identifiers.entry_id field (or other replace attribute) from getting populated?
What about the COLLECTION_DOCUMENT_REPLACE_ATTRIBUTE_NAMES list, is it guaranteed to be maintained properly?

piehld commented 1 year ago

@aliciaaevans Thanks for reviewing and nice find on the regex usage! I would normally think that the rcsb_id field will always be capitalized, but the fact that John explicitly added the # case-insensitive comment and other comments there makes it seem intentional, so I wonder if there is a case where it wouldn't be capitalized. I know that this purge method can be applied to any database and collection—not just the core databases—so perhaps lowercase IDs can occur in them?

These are John's comments. Clearly he did some testing and ultimately decided on the case-insensitive method. But I agree that making it case-sensitive would definitely be a good middle-ground option

# selectD = {"rcsb_id": "/%s/" % cardId} # this filter did not work
selectD = {"rcsb_id": {"$regex": "^%s" % cardId.upper(), "$options": "i"}}  # case-insensitive

Speculation about why the purge is there in the first place:

Is it possible certain types of errors were preventing the *_identifiers.entry_id field (or other replace attribute) from getting populated?

What about the COLLECTION_DOCUMENT_REPLACE_ATTRIBUTE_NAMES list, is it guaranteed to be maintained properly?

Those are both good points and I think one or both of them are likely the reasons. For the first point, I'm not sure if a document would be able to load without them, but that is dependent on the schema, which—as is also the case for your second point—would require careful maintenance of the schema. So, perhaps the potential for error/oversight in the schema is what motivated him to add this brute force method...

rcsb / py-rcsb_db

Feature: V1.707 Updates to PdbxLoader and RepoLoadWorkflow to skip brute force document purge by default #59