logpai / loglizer

A machine learning toolkit for log-based anomaly detection [ISSRE'16]
MIT License
1.27k stars 423 forks source link

Invariants Mining recall and precision #40

Closed riyu94 closed 5 years ago

riyu94 commented 5 years ago

Hi,

I am using Invariants Mining without Label code to train data and have modified the code to get the anomaly.csv from the output we get from trained data.

Then, I tried using the above generated anomaly.csv instead of the anomaly provided in the project, to test data with train-ratio as 0.5. (using Invariants mining with label)

I get the results (number of anomalies in train and test), but it shows recall and precision as 0.00. Tried modifying HDFS log file but does not help.

Please help, The code is indeed helpful but documentation needed. Also, can I deploy this algorithm with Apache Spark? Since it will be too slow otherwise.

amineebenamor commented 5 years ago

Hello, I've already used Invariants Mining and it works for me. Your message isn't enough for me to know where the problem is, as you did some changes to the original code. If I can have a look at your code, I can check where the problem comes from.

amineebenamor commented 5 years ago

Did you solve your problem? What did you do then? I didn't have time to look for it today I was going to see it tomorrow.

riyu94 commented 5 years ago

Yes, I was assuming that the anomaly.csv generated by keeping the train ratio as 1, can be used to evaluate results when we test the same input log using train ratio 0.5. But in both cases, mining invariants list was different so It will not be the same.

Anyway, no problem. I had another question, Invariants Mining, what kind of anomalies it is good for? 1. Like if we have new type of (never seen) log messages, will it be able to detect?

  1. Or if we have a sudden spike in a particular type of message, will it be able to mark it?
amineebenamor commented 5 years ago

The Invariants Mining algorithm extracts invariants from the training set: for example, in the normal executions of a system, the number of log messages indicating "Open file" is usually equal to the number of log messages corresponding to "Close file", because each opened file will be closed at some stage eventually. These invariants represents the normal execution flows of the system. It also gives an easy interpretation for each anomaly as we can check which invariant is broken. Therefore, it is good for any type of anomaly that would break the normal execution of the system (by breaking the invariants). To answer your question:

  1. If a new log breaks certain invariants, we detect an anomaly that occured during the system execution. Otherwise, it won't be specified as an anomaly.
  2. It depends on the type of sudden spike in a particular type of message, but if it breaks an invariant, then it means that it has broken a normal execution of the system and will be considered as an anomaly. As the method is based on a log message count matrix, I think it would be considered as an anomaly in most cases.

Let me know if you have more questions, I'm working on Invariants Mining also :)

riyu94 commented 5 years ago

@amineebenamor Helps :) Thank you.

zhujiem commented 5 years ago

@amineebenamor @riyu94 Many thanks for amineebenamor‘ support! I want to make some additional comments to amineebenamor‘ answer.

  1. Like if we have new type of (never seen) log messages, will it be able to detect?

Yes, our current implementation supports the detection of this type of anomaly. In the fit_transform method, we add two arguments oov=False and min_count=1. When setting oov=True during preprocessing, we will mark all events < min_count as 'oov' events (out-of-vocabulary). When setting min_count=1, we get normal instances with zero oov events, but new events will make oov events>1. This matches your problem. Meanwhile, we can also drop some rarely occur events during training by setting, for example, min_count=5.

  1. Or if we have a sudden spike in a particular type of message, will it be able to mark it?

If one event type gets a sudden spike, it will lead to an invariant break. But a sudden spike of an event pair will not.