[FEATURE] Support for subqueries or chaining of queries

dtaivpp commented 1 year ago

Is your feature request related to a problem? In threat hunting its often the case that you need to "join" on the same table for queries. For example: take a flat index filled with processes and information such as process ID and related details.

Case 1: In an environment it maybe normal for Outlook, OneNote, and some arbitrary.exe to have processes. When they are spawned in a chain like the following:

Outlook L OneNote L arbitrary.exe

That could be a malicious attacker starting some sub-process. There needs to be a mechanism for querying like this.

Case 2: When event pipes are spun up for inter-process communication they are named with the spawning process ID. When attackers create these pipes the often use random names. One way of checking if a pipe is validating if the process ID for the pipe is a valid process ID.

Event Pipe: event.1234.xyz

Process, 4321: L Second process, 1234: L Third process, 8282 that spun up the pipe

In the above you would see the pipe spun up by the 3rd process and would want to validate the ID of the pipe exists somewhere in the above chain of processes.

What solution would you like? There should be some straightforward way to handle join queries such as this

What alternatives have you considered? One compelling alternative would be to use a graph implementation such as Yang-DB. This is a less than optimal solution for most threat hunters. The reason is that most threat hunters are less than familiar with graph databases and how to query them. Additionally, they want their skills to be transferable and many of the other systems they would use to do this same task support joining in some manner.

MaxKsyunz commented 1 year ago

Sounds like common table expressions would solve this. [ref]

dtaivpp commented 1 year ago

@MaxKsyunz I think you are right there. Seems like this would be a perfect candidate since the WITH expression is the basis for recursive queries. I'd be a bit nervous to suggest implementing these as that feels like something that without care could take down a cluster but I am sure there must be some safety rails we can implement. Maybe backpressure could ensure these wouldn't run awry?

acedef commented 1 year ago

This is great! Joining/subsearching on the same "table" would be super useful - especially since a lot of useful relationships aren't known at the time of ingest. The named pipe use case is a good example of this.

acarbonetto commented 1 year ago

I was thinking this was a problem between tables. Doing table/index joins is hard for OpenSearch, because it isn't optimized to do cross-query searches. Using an alternate index source to map these joins would be extremely helpful for the plugin to process. One option is the use of materialized views (as proposed https://github.com/opensearch-project/sql/issues/1080), or a secondary storage like Spark. These data sources could accommodate relational joins data better than OpenSearch. Alternatively, a graph database would work very well to map relationships between indexes, and as a bonus - graph databases are well-known to help solve threat detection on graph-like systems (using a system like https://tinkerpop.apache.org/docs/current/reference/)

However, secondary storage doesn't always scale well like OpenSearch (e.g. tinkergraph) and requires that the user map their data on ingest.

Treating this as a single-index problem, were one might want to compare/join a single index against itself can be solved (in some cases) by a query re-write. Alternatively, mapping the data in a Nested object or Join object could potentially also solve the sub-query instance (https://opensearch.org/docs/latest/field-types/join/). I'm wondering if Join objects could satisfy the need for this use case.

acarbonetto commented 1 year ago

I put together a quick proposal for JOIN with USING to handle the same-table parent-child relation query. This solves the issue and utilizes OpenSearch specific functionality, so we won't be overloading the OS-SQL plugin.

Two caveats:

users will have to setup their mappings properly with parent-child relations, and
OS-SQL will need to set the routing shard itself (since this isn't configured by the OS system)

The syntax calls would look something like this (using a game-of-thrones dataset with houses and their members being the parent-child relations):

OS-SQL query:

SELECT m.name
FROM got as m
JOIN got as h USING h.member_of_house.house
WHERE h.housename = "Targaryen"

Mapping setup would look like this in the database:

{
  "mappings": {
    "properties": {
      "member_of_house": { 
        "type": "join",
        "relations": {
          "house": "member" 
        }
      },
      ...

The house data would be setup thusly:

{"index":{"_id":"1"}}
{"words":"Fire And Blood","housename":"Targaryen","sigil":"Dragon","seat":"Dragonstone", "member_of_house":"house"}

And the house members thusly:

{"index":{"_id":"4"}}
{"name":{"firstname":"Daenerys","lastname":"Targaryen","ofHerName":1},"nickname":"Daenerys \"Stormborn\"","gender":"F","parents":{"father":"Aerys","mother":"Rhaella"},"titles":[{"title":"motherOfDragons"},{"title":"queenOfTheAndals"},{"title":"breakerOfChains"},{"title":"Khaleesi"}],"member_of_house":{"name":"member", "parent":"1"}}

The pushdown to OpenSearch would look like:

{
    "query": {
        "has_parent": {
            "parent_type": "house",
            "query": {
                "match": {
                    "house": "Targaryen"
                }
            }
        }
    },
    "_source": {
        "includes": [
            "name"
        ],
        "excludes": []
    }
}

opensearch-project / sql

[FEATURE] Support for subqueries or chaining of queries #1441