shatteredsilicon / ssm-submodules

GNU Affero General Public License v3.0
1 stars 2 forks source link

QAN Database Keeps Query Classes Long After No Examples of them Remain #171

Closed gordan-bobic closed 1 year ago

gordan-bobic commented 1 year ago

It looks like query_classes table keeps query classes forever - even after no corresponding entry in query_examples table remains. Under those conditions, as far as I can tell, queries that have no examples will never actually show up in QAN, so there could be millions of query_class rows that serve no purpose.

Objective: 1) Verify that we don't use this orphaned query_class data anywhere 2) Add this to cleanup process so that the data is purged.

Watch out for the edge case of purges through node deletion. It looks like instance_id comes from query_examples. If we delete a node, we probably delete query_examples based on this, but it looks like we don't also clean up the orphaned queries from query classes (because the same query class could come from multiple nodes). This means that query_classes cleanup will have to be based on something like:

select query_classes.query_class_id qcic, query_examples.query_class_id qcie from query_classes left join query_examples on query_examples.query_class_id = query_examples.query_class_id having qcie is null;

Get the list of qcic, then purge them by primary key.

Or maybe there is a way to "cheat", by deleting query_classes where last_seen is older than the data retention period (QUERY_RETENTION environment variable, if I remember correctly), plus a day just to make sure.

Either way, this should be added to the data pruning process, I don't think there is any usefulness to this stale query_classes data because I don't see how it could show up in QAN, and I cannot think of any other use for it.

oblitorum commented 1 year ago

Now that you memtioned the QUERY_RETENTION env (it's QUERIES_RETENTION actually), I found that the auto-remove qan data feature is actually not working, do we still need this feature?

gordan-bobic commented 1 year ago

Yes, pretty sure we need that feature to work.

oblitorum commented 1 year ago

Or maybe there is a way to "cheat", by deleting query_classes where last_seen is older than the data retention period (QUERY_RETENTION environment variable, if I remember correctly), plus a day just to make sure.

OK, I think I'm gonna add this to the "auto-purge qan data feature", and for node deletion edge case, we just leave it to the auto-purge process, so we don't need 2 separate processes to do this, although the query_classes table data will not be purged immediately after the node deletion in this case, is that OK?

gordan-bobic commented 1 year ago

Sounds good. It doesn't have to be deleted immediately, regular scheduled daily purge cycle is absolutely fine.