hitsz-ids / duetector

duetector🔍: Data Usage Extensible Detector for data usage observability.
https://dataucon.idslab.io/
Apache License 2.0
10 stars 8 forks source link

EPIC: Towards Measurable Data Usage #97

Open wunder957 opened 10 months ago

wunder957 commented 10 months ago

Problems

This project was initially directed towards unplugged detection of data usage behaviour through eBPF technology. I'm glad we've initially implemented a framework for it. But want to make the probe results available to other applications (e.g. the Data Usage Controller of DataUcon project), we need to expose the results of our recording in a machine-readable format.

On the other hand, we need to finish standardising the storage side of things, and for large numbers of events, a traditional SQL database is not a good choice.

We don't yet have a good production example to represent our capabilities.

Status quo and Future

Relationship with storage back-end

OpenTelemetry is sought after by related projects as an open source standard for observability. We believe that although our project is far from observability in terms of observables, goals, and functions. However, our project is similar to OpenTelemetry related projects in terms of technical implementation, and we should be able to benefit from the development of OpenTelemtry and related backends.

As the project has evolved, we have completed the integration with OpenTelemetry: https://github.com/hitsz-ids/duetector/pull/82. Next, we will make OpenTelemetry our primary support, and SQL databases MAY NOT be actively maintained.

We are currently using jaeger as the first backend to access the.

Cloud-native support

We will natively support monitoring of containers on the cloud, so let's start with the docker and k8s.

How to expose data

We will first build a querier for the jaeger backend to restore the tracer data from the backend, and then implement an analytics engine that can form an analysis of the tracer data to derive a picture of how the process is using the data. We will refer to this process as the measurement of data usage

production example

We previously accepted a machine learning case for MNIST that included analysis and associated probing points for data usage behaviours: https://github.com/hitsz-ids/duetector/pull/84, and I thought we could start with this case to demonstrate our data usage measurement capabilities

Other maintenance

Instead of (at least not in the near future) splitting the project into a queryer and a detector, we'll build two different images based on the same Python package(duetector). We already have a different CLI entry point, so I'm sure this won't be difficult.

In addition, we need to optimise the README document and the design document a bit, assuming the backend to be OpenTelemetry

Roadmap

This EPIC will be released as version 1.0.0, prior to which the features described above will be integrated as version 0.x.y and in a gradual development process.

Regarding data use measurability, I am working on some related blogs (in Chinese).

Wh1isper commented 8 months ago

Due to personal reasons I(aka @wunder957 ) will be leaving the project for a while, there is no one actively maintaining the project at the moment, if you are interested in getting involved feel free to contact me or any member of hitsz-ids.