tnc-ca-geo / animl-api

Backend for https://animl.camera
Other
4 stars 0 forks source link

Audit and improve DB indexing #119

Closed nathanielrindlaub closed 2 months ago

nathanielrindlaub commented 1 year ago

We're getting email alerts from the prod DB saying:

Query Targeting: Scanned Objects / Returned has gone above 1000   The ratio of documents scanned to returned exceeded 1000.0 on  animl-cluster-shard-00-01.84ieh.mongodb.net, which typically suggests that un-indexed queries are being run. To help identify which query is problematic, we recommend navigating to the Query Profiler tool within Atlas. Read more about the Profiler here. | Query Targeting: Scanned Objects / Returned has gone above 1000 |   | The ratio of documents scanned to returned exceeded 1000.0 on  animl-cluster-shard-00-01.84ieh.mongodb.net, which typically suggests that un-indexed queries are being run. To help identify which query is problematic, we recommend navigating to the Query Profiler tool within Atlas.

When I take a look at the MongoDB Atlas Query Profiler and db operation in question, it looks like it's the getLabels aggregation that occurs every time a projects query is requested:

{
  "type": "command",
  "ns": "animl-prod.images",
  "command": {
    "aggregate": "images",
    "pipeline": [
      {
        "$match": {
          "projectId": "sci_biosecurity"
        }
      },
      {
        "$unwind": "$objects"
      },
      {
        "$unwind": "$objects.labels"
      },
      {
        "$match": {
          "objects.labels.validation.validated": {
            "$not": {
              "$eq": false
            }
          }
        }
      },
      {
        "$group": {
          "_id": null,
          "uniqueCategories": {
            "$addToSet": "$objects.labels.category"
          }
        }
      }
    ],
    "cursor": {},
}
...

That approach to aggregating labels is pretty expensive to begin with (and might be worth re-thinking entirely), but when I look at the existing indexes in prod via MongoDB compass, it looks like we're not doing ourselves any favors by only indexing on the _id field. We should definitely be indexing by projectId as well - which we are in the dev DB and is defined in the Mongoose Image schema. What's odd is that it looks like I added those indexes to the schema over a year ago, so I'm not sure why they didn't get re-indexed on prod?

Additionally, mongo-cursor-pagination recommends that we index every field that might be passed to its paginatedField prop. For Image.queryByFilter(), it looks like the paginated field options for Image querying are basically any column in the image table in the front end that can be sorted, so currently dateTimeOriginal, dateAdded, and cameraId. We've also been using mongo-cursor-pagination elsewhere (for querying Batches and BatchErrors, for example), so it would be worth auditing those and seeing what fields if any should get indexed on their respective collections.

nathanielrindlaub commented 1 year ago

Good rules of thumb to follow when planning index strategies here: https://www.mongodb.com/docs/manual/applications/indexes/

nathanielrindlaub commented 1 year ago

After reading up on this and the looking at the indexes that MongoDB was recommending through it's "Performance Advisor" tab in its UI, I came to the conclusion that I would not have anticipated the indexes it recommended (not even close), so perhaps it's best to create indexes via the UI rather than in code and just follow what the Performance Advisor recommends.

For now let's go with that and see if we run into trouble.

nathanielrindlaub commented 8 months ago

Reopening this because we're still getting those query warnings even with new Project Label querying improvements, and the index size has ballooned. We need a real strategy for making sure queries are scalable.

jue-henry commented 7 months ago

While investigating indexes and possible optimizations a few potential bugs were found. Adding them here for visibility:

  1. Timeout caused when both reviewed and not reviewed flags are unselected:

"$match": { "$or": [ { "objects.locked": false }, { "objects": { "$size": 0 } }, { "objects": { "$not": { "$elemMatch": { "labels": { "$elemMatch": { "$or": [ { "validation": null }, { "validation.validated": true } ] } } } } } } ] } }, { "$match": { "objects.0": { "$exists": true }, "objects.locked": { "$ne": false }, "objects.labels": { "$elemMatch": { "$or": [ { "validation": null }, { "validation.validated": true } ] } } } }, { "$match": {} },

Not exactly a use case that should be supported, since no images should be returned. Should either be handled on frontend to prevent this option from being selected or input check should be done on the backend to prevent this potential timeout. Addressing this on the frontend as a part of this PR.

  1. When viewing the home page, we query for images in the project sorted by dateTimeOriginal, and it appears we are also sorting via image._id. This is not explicitly done by us, so we need to investigate if there is a potential optimization by removing this secondary sort field. According to this article, date holds precedence to string, so actual sort order isn't affected. Will have to look into if this affects performance. Currently we have an index supporting projectId, dateTimeOriginal, _id, this potentially can be simplified to remove _id.

{ "type": "command", "isFromUserConnection": true, "ns": "animl-prod.images", "collectionType": "normal", "command": { "aggregate": "images", "pipeline": [ { "$match": { "projectId": "island_spotted_skunks" } }, { "$match": {} }, { "$sort": { "dateTimeOriginal": -1, "_id": -1 } }, { "$limit": 51 } ],

  1. Small visual bug Screenshot 2024-03-18 at 3 37 22 PM

Steps to reproduce: have a successful image count request, and then cause image count to timeout. ex: unchecking both reviewed and not reviewed

@nathanielrindlaub

jue-henry commented 7 months ago

Did some cleanup of indexes based on common query requests as of 03/20/24 - 22:15:43.807UTC.

Will continue to monitor the performance advisor and for alerts. Opened #161 in order to reduce index size and to more easily query for reviewed/not-reviewed images. Will also update indexes to use this field once the back-end has been updated with this field. In the meantime, using objects.locked in order to represent reviewed/not-reviewed images. Need better understanding of multi-key indexes: https://www.mongodb.com/docs/manual/core/indexes/index-types/index-multikey/. Using objects and objects.locked together should not be allowed, but is automatically done via mongodb Atlas.

nathanielrindlaub commented 2 months ago

@jue-henry, do we want to close this out in favor of some of the other related DB-optimization issues we have? Or is it worth keeping it open? Up to you.

jue-henry commented 2 months ago

@nathanielrindlaub I think we can close this for now as we have other issues open and indexes are not a huge problem from what I can tell