Data retention, multi-tenancy data loss

juju4 commented 1 year ago

Bug Template

Description: When looking at cleanup jobs*, it seems it remove data which is not matching current UPDATE_TAG which means that there is no history preserved, only last execution for a given source. is it correct?

Ideally, a custom retention would be possible to keep data for X days and purge only if not updated past this period. The neo4j merge should ensure that data is not duplicated and allow to preserve first seen time. If above is correct, it also conflicts with collection of multiple instances of the same source, for example multiple Azure tenants. My ongoing testing seems to go in this direction as even with two tenant collections, I only get data for one at a given time.

(*) https://github.com/lyft/cartography/blob/master/cartography/data/jobs/cleanup/azure_import_virtual_machines_cleanup.json https://github.com/lyft/cartography/blob/master/cartography/data/jobs/cleanup/crowdstrike_import_cleanup.json

To Reproduce:

Set up cartography to collect one data type for two instances, for example Azure or another cloud provider.
Check if database contains data of both instance at the same time.

Please complete the following information::

Cartography release version or commit hash HEAD + patch from https://github.com/juju4/cartography/commits/devel-all

Python version: 3.10.6 (Ubuntu 22.04)

OS (feel free to omit this if you don't think it's relevant to your issue): Ubuntu 22.04

achantavy commented 1 year ago

When looking at cleanup jobs*, it seems it remove data which is not matching current UPDATE_TAG which means that there is no history preserved, only last execution for a given source. is it correct?

Correct, this is by design. The idea is that the graph should always show its belief of the current state of the world.

Custom retention will be a very large rewrite and unfortunately I don't think we can commit to adding that in this project at this time.

To partially address the historical data usecase, we do have drift detection (https://lyft.github.io/cartography/usage/drift-detect.html, shown in https://eng.lyft.com/iam-whatever-you-say-iam-febce59d1e3b [search for "tracking IAM changes over time"]), and we've used other tools to get data out of the Neo4j database at various times for reporting (something like this: https://eng.lyft.com/powering-security-reports-with-cartography-and-flyte-fd02a4a96b2f, or this: https://blog.marcolancini.it/2020/blog-tracking-moving-clouds-with-cartography/).

Yet another workaround for historical data would involve doing daily backups of the data (but that solves an entirely different problem).

achantavy commented 1 year ago

When it comes to multitenancy, I would make sure that all tenants are synced with the same update tag. You may need to explore building a custom sync script to get that working, see https://lyft.github.io/cartography/dev/developer-guide.html?highlight=custom%20sync#implementing-custom-sync-commands.

juju4 commented 1 year ago

Normally the last updated/last seen information will indicate if something is current or not. And it may be desirable to have some level of past information. Typically security tools keep assets inventory 7 to 45 days before removing them because investigation does not always happen same day.

On my side, I addressed it with simple workaround to not clear less than 7d example for azure

$ cat cartography/data/jobs/cleanup/azure_import_virtual_machines_cleanup.json 
{
    "statements": [
        {
            "query": "WITH datetime()-duration('P7D') AS threshold MATCH (n:AzureDataDisk)-[:ATTACHED_TO]->(:AzureVirtualMachine)-[:RESOURCE]->(:AzureSubscription{id: $AZURE_SUBSCRIPTION_ID}) WHERE n.lastupdated < threshold WITH n LIMIT $LIMIT_SIZE DETACH DELETE (n)",
            "iterative": true,
            "iterationsize": 100
        },
        {
            "query": "WITH datetime()-duration('P7D') AS threshold MATCH (:AzureDataDisk)-[r:ATTACHED_TO]->(:AzureVirtualMachine)-[:RESOURCE]->(:AzureSubscription{id: $AZURE_SUBSCRIPTION_ID}) WHERE r.lastupdated < threshold WITH r LIMIT $LIMIT_SIZE DELETE (r)",
            "iterative": true,
            "iterationsize": 100
        },
        {
            "query": "WITH datetime()-duration('P7D') AS threshold MATCH (n:AzureVirtualMachine)-[:RESOURCE]->(:AzureSubscription{id: $AZURE_SUBSCRIPTION_ID}) WHERE n.lastupdated < threshold WITH n LIMIT $LIMIT_SIZE DETACH DELETE (n)",
            "iterative": true,
            "iterationsize": 100
        },
        {
            "query": "WITH datetime()-duration('P7D') AS threshold MATCH (:AzureVirtualMachine)-[r:RESOURCE]->(:AzureSubscription{id: $AZURE_SUBSCRIPTION_ID}) WHERE r.lastupdated < threshold WITH r LIMIT $LIMIT_SIZE DELETE (r)",
            "iterative": true,
            "iterationsize": 100
        }
    ],
    "name": "cleanup Azure Compute related resources"
}

1day would be the minimum when collecting data of multiple tenants/instance, else latest run remove the others.

lyft / cartography

Data retention, multi-tenancy data loss #1015