opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
91 stars 129 forks source link

[FEATURE] Mitigation of Orphaned Interactions in Conversational Memory #1267

Open HenryL27 opened 1 year ago

HenryL27 commented 1 year ago

Background We created a Conversational Memory CRUD API per #1150 to deal with long term storage of conversations for conversational applications. Conversational conversation conversations conversational :). Anyway, in this API, we identify two resources: ConversationMeta, and Interaction. A ConversationMeta contains some information about the Conversation as a whole, whereas an Interaction is the (call-and-response) individual unit of a Conversation - a Conversation is made up of Interactions. We store ConversationMetas and Interactions in two separate indices.

Is your feature request related to a problem? When deleting a conversation, some interactions may be left over from the bulk delete - and with the ConversationMeta object gone, the only way to get rid of them is for the admin to manually delete them one by one. (Or maybe with some kind of fancy delete-by-query)

What solution would you like? Two additions:

  1. Add a retry mechanism to the bulk delete, so that failed deletions have to fail a couple times before we give up.
  2. Implement garbage collection to occasionally detect and delete orphaned interactions

What alternatives have you considered? Garbage collection without retry is also sufficient to eliminate orphans, although we don't necessarily need to create too much work for the collector

Do you have any additional context? PR 1196

HenryL27 commented 1 year ago

Another option is to re-implement the storage layer as a single index, where the "interaction" type is a nested field of "conversation" plus: delete is as close to atomic as I think you can get. minus: nested queries tend to have poor performance (Not sure that it would be that much worse than what already exists though, would want to test it), and conversations would have a limited size (not sure that's an issue since the limit is by default like 1000, which is a lot of interactions). Also, CreateInteraction will take place as an updateRequest, so the Conversation object becomes mutable again, opening the door for different potential concurrency failures. It also makes a future _search endpoint more complicated

The index schema would become something like:

.plugins-ml-conversations
{
  "_meta": {
    "schema_version": 2
  },
  "properties": {
    "name": {"type": "keyword"},
    "create_time": {"type": "date", "format": "whatever the format is"},
    "user": {"type": "keyword"},
    "interactions": {
      "type": "nested",
      "properties": {
        "create_time": {"type": "date", "format": "whatever the format is"},
        "input": {"type": "text"},
        "prompt_template": {"type": "text"},
        "response": {"type": "text"},
        "origin": {"type": "keyword"},
        "additional_info": {"type": "text"}
      }
    }
  }
}
HenryL27 commented 1 year ago

The garbage collector can have 2 possible implementations, at least that I can think of:

  1. Something similar to MLSyncUpCron.java: a job that runs at fixed intervals using the threadpool and lifecycle methods. It will perform an aggregation query over all of the conversation ids held by interactions, and then a query to get all the conversation ids in the conversation index. Then determine the interactions-based conversation ids that aren't in the set of conversations-based conversation ids and delete the interactions with them.
  2. An independent lifecycle component (or such thing). Whenever a conversation fails to be deleted, tell the component (or maybe, tell it which interactions didn't get deleted). Then, if the number of failed deletions is above some threshold, try to perform those deletions again (only removing from the deletion queue if the deletion is successful).

Option 1 is probably a little more stable, since it's stateless - whenever garbage collection occurs, the deletion computation is done based on the current state of the database. That said, that computation could get super expensive at large scale.

Option 2 is computationally cheaper, since it is keeping track of what to delete as deletions occur - we don't need to do the whole set-comparison operation. However, these ids are stored ephemerally, which both takes up a permanent, passive drain on memory resources - although theoretically that's only up to the deletion threshold - and if the cluster goes down, then the ids go with it (I guess you'd have it perform option 1's set-comparison on startup to avoid correctness issues). This could potentially also be attacked by creating large numbers of failed deletions, triggering the garbage collection many times repeatedly - if the collection (deletion) itself is expensive then this could be a problem. Last, since OpenSearch runs on a cluster I'd need to figure out some of the coordination stuff - we want to record which items to delete in exactly one place.

HenryL27 commented 1 year ago

@dhrubo-os @jngz-es @ylwu-amzn thouhgts? Do you have a preference which of these options should I go for? Is there another one I haven't even considered?

HenryL27 commented 12 months ago

Having done some more research it looks like the nested index is not really an option - it's not possible to get only the inner documents (yes, I could parse through them myself, but I want to add a search api, and I don't want to have to 1/ reimplement all the search functionality on my own, 2/ directly use lucene to bypass OpenSearch's wrapper and query the inner documents directly, or 3/ temporarily load all the interactions into an index, query over it, and then delete the index.

Now, I'm using the bulk API to delete items, and the bulk API says that any subset of its instructions can fail. But for deletions, all that happens is that Lucene marks down that the deleted item has been deleted, and doesn't actually delete it. So I wonder if the 'delete' instructions to the bulk API will ever fail in practice? I've not seen it happen. If not then this whole point is moot.