matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.46k stars 100 forks source link

Implement deduplication for threat intel enrichment ingestion #39

Closed Samrose-Ahmed closed 1 year ago

Samrose-Ahmed commented 1 year ago

Overview

Many threat intel sources are not static and are modified/updated. If we are polling for data based on time, this will introduce duplicates.

Goal

Add ability to deduplicate data ingested from enrichment sources.

Notes

Can be implemented with Athena V3 Iceberg MERGE INTO

MERGE INTO enrichment_table main USING enrichment_table_temp new
    -- primary key
    ON (main.event.id = new.event.id)
    WHEN MATCHED
        -- all top level cols
        THEN UPDATE SET event = new.event, threat = new.threat
    WHEN NOT MATCHED
        -- all top level cols
        THEN INSERT (event, threat) VALUES(new.event, new.threat)