Run distributed query/carve based on custom tags

zhuoyuan-liu commented 2 weeks ago

We just explored osctrl-admin and found that we can add a custom tag to each node/device. However, after added the custom tag, we cannot run a distributed query based on this tag. It would be great help if can also run the query based on these tags.

I would like to contribute to this feature, but I would like to know more details about the implementation.

In the architecture definition, the osctrl-admin should only talk to osctrl-api instead of the database directly. However, I found osctrl-admin would interact with the DB directly in many cases. I am completely fine with implementation and want to make sure if the rest of the changes are allowed to do so.

From the source code, I can see that currently it's based on four types of tags: env, platform, UUID and localname. I guess the easiest solution is to add an extra field so that we can pass the custom tags. What do you think?

javuto commented 2 weeks ago

This is something that I had planned to implement since I added tags, not only for distributed queries but for file carves as well (they are technically a type of distributed query), see https://github.com/jmpsec/osctrl/issues/76 and https://github.com/jmpsec/osctrl/issues/77 I see two different implementations that can be done:

Add a new field for tags to the existing implementation - It will be faster to implement but it will contribute to potential performance issues involving the backend.
Reimplement completely how distributed queries work - It will take longer but no more potential backend performance issues.

zhuoyuan-liu commented 1 week ago

Hi @javuto , I have the following idea with Redis:

When creating a distributed query, we find all target nodes based on the tags
Create a Redis set using node uuid as the key and put the query id into the set. Redis allows fast lookups to fetch all active tasks for a client, using operations like SMEMBERS to retrieve tasks associated with a client.
When nodes finish queries and send results back, mark the corresponding queries completed by removing them from the active task set using SREM (set remove). This ensures that the next time the client asks for queries, only unfinished queries will be returned.

I think it's enough for us, but if you want to actively track how many nodes are unfished, we can create another Redis set to maintain a list of unfinished nodes for each query or just query logs returned by nodes.

Benefits:

avoid massive database read and write. In the past, the read request need to go through the db and find a list of distributed query for the target node and the write request need to update the counter in db for each distributed query.
reduce latency for distributed read request since we changed the db query to a redis set query.

jmpsec / osctrl

Run distributed query/carve based on custom tags #529