Open HenryL27 opened 1 year ago
Another option is to re-implement the storage layer as a single index, where the "interaction" type is a nested field of "conversation" plus: delete is as close to atomic as I think you can get. minus: nested queries tend to have poor performance (Not sure that it would be that much worse than what already exists though, would want to test it), and conversations would have a limited size (not sure that's an issue since the limit is by default like 1000, which is a lot of interactions). Also, CreateInteraction will take place as an updateRequest, so the Conversation object becomes mutable again, opening the door for different potential concurrency failures. It also makes a future _search endpoint more complicated
The index schema would become something like:
.plugins-ml-conversations
{
"_meta": {
"schema_version": 2
},
"properties": {
"name": {"type": "keyword"},
"create_time": {"type": "date", "format": "whatever the format is"},
"user": {"type": "keyword"},
"interactions": {
"type": "nested",
"properties": {
"create_time": {"type": "date", "format": "whatever the format is"},
"input": {"type": "text"},
"prompt_template": {"type": "text"},
"response": {"type": "text"},
"origin": {"type": "keyword"},
"additional_info": {"type": "text"}
}
}
}
}
The garbage collector can have 2 possible implementations, at least that I can think of:
MLSyncUpCron.java
: a job that runs at fixed intervals using the threadpool and lifecycle methods. It will perform an aggregation query over all of the conversation ids held by interactions, and then a query to get all the conversation ids in the conversation index. Then determine the interactions-based conversation ids that aren't in the set of conversations-based conversation ids and delete the interactions with them.Option 1 is probably a little more stable, since it's stateless - whenever garbage collection occurs, the deletion computation is done based on the current state of the database. That said, that computation could get super expensive at large scale.
Option 2 is computationally cheaper, since it is keeping track of what to delete as deletions occur - we don't need to do the whole set-comparison operation. However, these ids are stored ephemerally, which both takes up a permanent, passive drain on memory resources - although theoretically that's only up to the deletion threshold - and if the cluster goes down, then the ids go with it (I guess you'd have it perform option 1's set-comparison on startup to avoid correctness issues). This could potentially also be attacked by creating large numbers of failed deletions, triggering the garbage collection many times repeatedly - if the collection (deletion) itself is expensive then this could be a problem. Last, since OpenSearch runs on a cluster I'd need to figure out some of the coordination stuff - we want to record which items to delete in exactly one place.
@dhrubo-os @jngz-es @ylwu-amzn thouhgts? Do you have a preference which of these options should I go for? Is there another one I haven't even considered?
Having done some more research it looks like the nested index is not really an option - it's not possible to get only the inner documents (yes, I could parse through them myself, but I want to add a search api, and I don't want to have to 1/ reimplement all the search functionality on my own, 2/ directly use lucene to bypass OpenSearch's wrapper and query the inner documents directly, or 3/ temporarily load all the interactions into an index, query over it, and then delete the index.
Now, I'm using the bulk API to delete items, and the bulk API says that any subset of its instructions can fail. But for deletions, all that happens is that Lucene marks down that the deleted item has been deleted, and doesn't actually delete it. So I wonder if the 'delete' instructions to the bulk API will ever fail in practice? I've not seen it happen. If not then this whole point is moot.
Background We created a Conversational Memory CRUD API per #1150 to deal with long term storage of conversations for conversational applications. Conversational conversation conversations conversational :). Anyway, in this API, we identify two resources: ConversationMeta, and Interaction. A ConversationMeta contains some information about the Conversation as a whole, whereas an Interaction is the (call-and-response) individual unit of a Conversation - a Conversation is made up of Interactions. We store ConversationMetas and Interactions in two separate indices.
Is your feature request related to a problem? When deleting a conversation, some interactions may be left over from the bulk delete - and with the ConversationMeta object gone, the only way to get rid of them is for the admin to manually delete them one by one. (Or maybe with some kind of fancy delete-by-query)
What solution would you like? Two additions:
What alternatives have you considered? Garbage collection without retry is also sufficient to eliminate orphans, although we don't necessarily need to create too much work for the collector
Do you have any additional context? PR 1196