logpai / Drain3

A robust streaming log template miner based on the Drain algorithm
Other
450 stars 130 forks source link

parallel log ingestions #56

Closed kwokon0ng closed 2 years ago

kwokon0ng commented 2 years ago

Dear Drain3 project,

We are very glad to find drain3. We tested it with a logfile having about 200k lines and it finishes processing in about 10 secs in my macbook.

we would like to use drain3 to process logs from logfiles of same type from hundreds of sources, keeping one tree and state. Could you advise what is best way to parallelize log ingestion?

I looked at the code, seems processing log function add_log_messages should be run single threaded.

best regards

davidohana commented 2 years ago

Hello and thank you,

Please read my answer for a similar question here: https://github.com/IBM/Drain3/issues/16 for possible solutions that do not require code change in Drain3 itself.

Adding multithreading/multiprocessing support to Drain3 would be a very welcomed contribution. However, I am not sure that multithreading will provide much value here because of the Python GIL, so we will need to to a PoC and measure the performance improvement. Multiprocessing with child processes and shared memory might be a better option performance-wise, but its not trivial to implement either.

A possible direction to start with - since the vast majority of logs should match an existing template, and a new/changed template is pretty rare, its possible to process almost all logs in concurrency, and only when one Drain3 instance detects that it requires a change in the parse tree, it will avoid this log, and hand it over to the main Drain3 instance that will process and update the tree, then instruct child instances to sync their state.

kwokon0ng commented 2 years ago

Thanks David for your response.

its possible to process almost all logs in concurrency, and only when one Drain3 instance detects that it requires a change in the parse tree, it will avoid this log, and hand it over to the main

I think this suggestion makes sense. Does Drain3 have a built-in mechanism to support a main Drian3 instance that ingests template changes from child instances and update tree?

kwokon0ng commented 2 years ago

i think i have figured it out, in doc it says to use inference mode https://github.com/IBM/Drain3#training-vs-inference-modes

thanks

davidohana commented 2 years ago

Correct, the match() function can be used to determine if a log already matches an existing template. However, you will have to implement the synchronization between main and child Drain instances.