Closed Boris-2021 closed 2 years ago
Hi Boris, So indeed, Drain uses word count at the root of the search tree, so a user name with multiple words will generate a new template. I think that the suggestions you had are pretty fundamental and should be thoroughly tested before being integrated into Drain3. Other simper options you might consider: (1) Use regex masking if possible, to pre-mask usernames into a single token before Drain. (2) Consolidate multiple sequential <*> into one . This can be an opt-in feature we can add to Drain3.
David
Hi Boris, So indeed, Drain uses word count at the root of the search tree, so a user name with multiple words will generate a new template. I think that the suggestions you had are pretty fundamental and should be thoroughly tested before being integrated into Drain3. Other simper options you might consider: (1) Use regex masking if possible, to pre-mask usernames into a single token before Drain. (2) Consolidate multiple sequential <*> into one . This can be an opt-in feature we can add to Drain3.
David
Thank you for you advise
Boris
In the example of the SSH.log, I noticed this clustering result : during marking. As a result, the length of the entire log becomes longer. For example,
<L=5> ID=2 : size=14551 : Invalid user <:*:> from <:IP:> <L=6> ID=27 : size=30 : Invalid user <:*:> <:*:> from <:IP:>
but i think these two template clusters should belong to one type. The possible reason for this clustering is that the third word is composed of words and numbers, which are processed intoInvalid user test9 from 52.80.34.196
or The string that is actually in the variable of the log template is made up of two words. For example,Invalid user boris zhang from 52.80.34.196 (This log is not in the ssh.log file, but this example may appear in other datasets)
All of these log messages should be grouped into a single log template cluster, but they are not. So I wondered if I should devise a new matching pattern. The starting point of the design: 1, Cancel log message length as the first tree node. 2, The formula for calculating text similarity is designed to be calculated according to the text content.