stefanDeveloper / heiDGAF

heiDGAF - a machine learning based DNS inspector to detect DGAs in the wild!
https://heidgaf.readthedocs.io
European Union Public License 1.2
3 stars 0 forks source link

Maybe move timestamp extraction to Inspector #10

Closed lamr02n closed 1 month ago

lamr02n commented 2 months ago

Currently, the extraction of the begin_timestamp and end_timestamp is done by the Batch Sender. Since we updated the way we extract the timestamps, we could move this step to a later stage (for example the Inspector, in which the timestamps are needed). This could reduce message size because we would not need to send them as metadata.

stefanDeveloper commented 1 month ago

Well, technically, not. There's an error in thinking here. When the application starts, the sliding window should start and stop when we reach 1000 lines or after the timeout. Technically, we could take the beginning timestamp of the first message; however, it would not represent the window.

lamr02n commented 1 month ago

There's an error in thinking here. When the application starts, the sliding window should start and stop when we reach 1000 lines or after the timeout. Technically, we could take the beginning timestamp of the first message; however, it would not represent the window.

So, we could revert to the first implementation? Before we used the timestamps from the data, we stored our timestamps based on the times the messages entered or left the pipeline. I still see the problem that we're not using the timestamp of the window in which the log lines were recorded, but the timestamps from execution of our algorithm. The timestamps in the log lines might be from another time (e.g. could originate from a file).

stefanDeveloper commented 1 month ago

So, we could revert to the first implementation?

We could elaborate on in the next meeting.

I still see the problem that we're not using the timestamp of the window in which the log lines were recorded, but the timestamps from execution of our algorithm.

When we find something, we want to output all information to the user, especially the timestamps. Maybe in the future, we will add databases for profiling/monitoring, where timestamps come into play.

The timestamps in the log lines might be from another time (e.g. could originate from a file).

Yes, we could run into the problem that reading data from an old file. I guess we should therefore only consider new lines added, instead of reading the whole file.

lamr02n commented 1 month ago

The proposed change creates a problem with the data in later stages. Currently, the begin_timestamp and end_timestamp are extracted before filtering, therefore ensuring correct timestamps. If we moved the extraction to a later point (essentially after filtering), we might throw away relevant data points. We will leave the implementation as it is, since propagating the two timestamps through the pipeline adds virtually no overhead.