Questions regarding dataset and implementation details.

RooieRakkert commented 5 years ago

With great interest I've read (nearly) all of the papers released by this research group. I've found the papers a great resource, as the combination of papers give a broad view on the area of automated log parsing.

I've been working on an implementation for automated log parsing. Thus far I've adopted the Drain algorithm to generate log templates, and trained an LSTM to detect anomalies within the sequences of log keys. Seems to work great!

I still have some questions regarding the project, I hope you could answer these for me:

The datasets (found here; training, test - normal, test - abnormal) contain log keys (or template ID's) of parsed HDFS logs, is that correct?
Is there a specific reason that the log keys within these datasets are separated by newlines? I.E, does every line describe the logkeys from a specific time-bucket? In my implementation I haven't bucketed any of the data, as I'm using a sliding window to generate sequences of log keys out of the entire dataset of log keys.
Could you elaborate on how the workflow diagram is constructed? Did you consider the raw logs, or the parsed logkeys+parameters to construct the diagram?
Could you elaborate on how the parameter time series anomaly detection is created? The way I interpret it, for every unique log key a specific LSTM is trained, is that correct?
What kind of representation did you utilize to build such model? Because several parameters, such as filenames, will have a high variety (most of the names are unique). If these are converted to one-hot encodings, we will end up with a super high dimensional (sparse) vector representation, making it quite computational exhaustive..

I hope you can provide me with some answers to my questions. A big thanks and thumbs up for the great research you guys are doing!

amineebenamor commented 5 years ago

Hello @RooieRakkert , I'll answer your points one by one.

Yes it contains template IDs of parsed HDFS logs.
Each line represents the log sequence (list of log keys) concerning a specific block ID. So you need to split your logs per block IDs because they are independent.
I think you're talking about DeepLog paper. In this repo, they haven't implemented any workflow diagram. The diagram is constructed using the outputs of the LSTM.
You train a specific LSTM for every unique log key, using the associated parameters as input and you predict the next parameter.
I haven't built it so I cannot answer to your question :)

RooieRakkert commented 5 years ago

Hi @amineebenamor, thanks for your reply! Ah thats foolish of me, I forgot to split my logs on a machine level. Will fix this asap. I indeed was talking about the DeepLog paper, I first contacted one of the authors and they referred me to this repo. My mistake for not checking. Thanks anyway 👍

ShilinHe commented 5 years ago

@amineebenamor Thanks for your answer! @RooieRakkert Your questions are mainly about the DeepLog Paper, which we have not implemented. The authors should know more technical details about the paper. Regarding the last question, I think a feasible method is to embed the one-hot vector using the Embedding.

RooieRakkert commented 5 years ago

@ShilinHe thanks for your reply, that was indeed the exact approach I'm working on. Although, my main concern with the Embedding is that some of the information will be lost (due to the abstract, dimensionality reduced, representation)... Which might be a problem in regards to the explainability for the proposed solution. Especially considered the nature of the problem (sys log parsing).

ShilinHe commented 5 years ago

I don't think there is too much information loss because the embedding space is complex enough to cover the information that one-hot vectors reveal. I also agree with you that the embedding vector is not interpretable. But even you do not use the embedding, with LSTM, you cannot infer a rule for explaining the model prediction. This is actually a tradeoff, deep neural networks can provide you higher accuracy at the cost of explainability.

logpai / loglizer

Questions regarding dataset and implementation details. #44