Researching Anonymization Approaches for Observability Data

schwesig commented 7 months ago

Title: Researching Anonymization Approaches for Observability Data

Motivation:

In our observability cluster, which encompasses metrics, logs, and traces, the necessity for data anonymization has arisen. This is driven by the new and diverse range of users and researchers accessing this data. While are already implemented fine-grained access control to manage who can see what, the challenge extends to how the data is presented, ensuring sensitive information is appropriately anonymized.

Objectives:

Identify which data within our observability stack requires anonymization to safeguard user privacy and comply with data protection regulations.
Explore methodologies for effectively anonymizing identifiable information within metrics, logs, and traces.

Key Questions:

What Needs Anonymizing?: Determine the types of data that need anonymization. This includes understanding whether user names (e.g., in RHOAI namespaces), IP addresses, or other identifiable information are public and need masking.
How to Anonymize?: Investigate possible techniques for anonymizing data, such as masking certain log string areas or traces. Considerations include whether to replace identifiable information with placeholders (e.g., "X"), delete it, or apply different methods based on user roles.
Tools and Capabilities: Assess if our current tools like OpenShift, Prometheus, Grafana, etc., offer built-in anonymization features or if we need external tools or scripts to achieve our goals.

Tasks:

Data Identification: Catalog the specific pieces of information within our observability data that could potentially reveal user identities.
Methodology Research: Research and document various anonymization techniques that could be applied to our observability data.
Tool Assessment: Evaluate our current observability tools for existing anonymization features and identify any gaps that external solutions could fill.
Recommendations: Based on the research, recommend a strategy for anonymizing data that balances accessibility for authorized users with privacy and compliance requirements.

schwesig commented 7 months ago

collect links & ideas

different approach: metrics vs data science instead of doing data science on metrics data, creating a separate (already anonymzed) data set database.

real time data needed?
acceptable delay
different user roles for different data sources (global admin go to metrics and logs, "data science sre" go to the data sets database)

schwesig commented 1 month ago

ice box

nerc-project / operations