logpai / Drain3

A robust streaming log template miner based on the Drain algorithm
Other
450 stars 130 forks source link

Restrictions on matching mode #66

Closed Boris-2021 closed 2 years ago

Boris-2021 commented 2 years ago

In the example of the SSH.log, I noticed this clustering result : <L=5> ID=2 : size=14551 : Invalid user <:*:> from <:IP:> <L=6> ID=27 : size=30 : Invalid user <:*:> <:*:> from <:IP:> but i think these two template clusters should belong to one type. The possible reason for this clustering is that the third word is composed of words and numbers, which are processed into during marking. As a result, the length of the entire log becomes longer. For example, Invalid user test9 from 52.80.34.196 or The string that is actually in the variable of the log template is made up of two words. For example, Invalid user boris zhang from 52.80.34.196 (This log is not in the ssh.log file, but this example may appear in other datasets) All of these log messages should be grouped into a single log template cluster, but they are not. So I wondered if I should devise a new matching pattern. The starting point of the design: 1, Cancel log message length as the first tree node. 2, The formula for calculating text similarity is designed to be calculated according to the text content.

davidohana commented 2 years ago

Hi Boris, So indeed, Drain uses word count at the root of the search tree, so a user name with multiple words will generate a new template. I think that the suggestions you had are pretty fundamental and should be thoroughly tested before being integrated into Drain3. Other simper options you might consider: (1) Use regex masking if possible, to pre-mask usernames into a single token before Drain. (2) Consolidate multiple sequential <*> into one . This can be an opt-in feature we can add to Drain3.

David

Boris-2021 commented 2 years ago

Hi Boris, So indeed, Drain uses word count at the root of the search tree, so a user name with multiple words will generate a new template. I think that the suggestions you had are pretty fundamental and should be thoroughly tested before being integrated into Drain3. Other simper options you might consider: (1) Use regex masking if possible, to pre-mask usernames into a single token before Drain. (2) Consolidate multiple sequential <*> into one . This can be an opt-in feature we can add to Drain3.

David

Thank you for you advise

Boris