markNZed / ARTimeNAB.jl

ARTime detector for the Numenta Anomaly Benchmark
GNU Affero General Public License v3.0
16 stars 2 forks source link

[Q] processing larger batch - initial phase #1

Open earthgecko opened 2 years ago

earthgecko commented 2 years ago

Hi @markNZed

Firstly, congratulations on taking 1st place on the NAB scoreboard, that is quite an achievement :tada: and great contribution.

It is quite amazing that julia code can run directly in Python, a fine testament to a community effort.

I have a few questions that perhaps you could answer for me.

  1. The implementation in here or in https://github.com/markNZed/NAB/tree/ARTimeNAB more specifically, is aimed at running through the dataset in an iterative manner (as per NAB) to score each data point. Is it possible to process the data set in large batches? For example, could one process 90% of data in one shot for training/learning, not being concerned with anomaly scores (p) in this phase and then iterate the last 10% of the data set as per the ARTimeNAB method and determine anomaly scores (p).

  2. If so, would that be quicker than the iteration method?

I did a test passing a values list to jl.ARTime rather than a single value and it returned an object with all the expected data, just as if it was one value and then iterated the final part of the data and did not get the expected result (an anomaly which is present in the iterative method), so that method I tried does not work, so I am wondering if there is a way to do it that will work.

markNZed commented 2 years ago

Hi @earthgecko

Thanks! It would probably be better to use the NAB repository instead of https://github.com/markNZed/NAB I created that fork to make the pull request for Numenta's NAB repo and it may be out of date.

  1. During development of ARTime I was using a pure Julia version of the benchmark (without many of the NAB features) that was running in batch mode. It would run each timeseries, collect the anomaly output, then score at the end. This also meant the timeseries could be run in parallel (Julia makes that easy).
  2. The ARTime benchmark in the dev environment ran faster than the scoring step of NAB. Because ARTime is doing online learning there is not much overhead for the anomaly indication. I expect it would run much faster if it was not online learning but I never tried that.

This version of the code is stripped down and simplified, the dev environment was a mess :) The main goal of ARTimeNAB was to demonstrate ARTime conforms to the NAB rules, so it processes one sample at a time. It is missing some performance optimizations in the algorithm and does not have an option for batch processing.

I'd be interested to know more about your project, are you trying to improve the overall runtime of NAB or just ARTime ? I believe NAB does support a batch mode, I think some of the other detectors are using it. I did not use it as I guessed it would then require a code review to check that the detector is following the rules. Cheers.

earthgecko commented 2 years ago

Hi @markNZed thanks for the response.

I am not really interested in the NAB side, I am interested purely in the ARTime, I would like to test running ARTime in skyline as an additional algorithm. The testing I have done with ARTime so far is very promising, it is a great algorithm compared to most :) It is very seldom one finds a new algorithm in the domain that lives up to the hype, many are just slight modifications of existing methods and even fewer are suitable for running on real time on streaming data. ARTime appears to be one of the few.

I will have to dive into julia a bit and tinker. Ideally I would like to port the algorithm to Python and understand it better, but that may be a few lines too far :) That said, if it works and works very well, a "not made here" mentally is limiting, having a julia dependency, does not bother me too much, seeing as Python/julia seem to work seamlessly and if the results are worth it, which in this case I think they probably will be, that can be tested. Skyline has a SNAB module which allows for testing with real, real time data and runs alongside the normal analysis pipeline so that one can really assess and score the performance of an algorithm in a production setting with real data, in real time, rather than toy data.

I would be very grateful if you could share any of the dev environment optimisations that you mention or even just point me in the directions where they may be gained. I shall try and tinker with ARTime.jl and see where I get to :) You cannot rush time series ;)

And once again Mark, a really heartfelt congratulations, from one NAB'er to another. Having NAB'ed myself, I know just how much of a great achievement this is. Considering you are the first to ever beat HTM!!! Just wrangling NAB is a feat in itself :) I do not know have many papers there have been over the years, describing new algorithms and specifically referencing NAB (Google scholar says 430) but next to nil have been added to the scoreboard, let alone topped it! I realise this is not you alone but the product of many, that is true in most cases. However you were the one that saw the potential, had the vision and took that step on to the top of the Everest of anomaly detection in time series ;) In "our" world that is a phenomenal achievement!

You just beat the 8 X world champion! Knocking Numenta HTM off the podium is like beating:

I just wanted to give some credit where credit is due, given the fairly common social norm of "no one is interested in your anomaly detection" paradigm at parties (and home), I am sure that you did not step up onto a podium, be handed a bottle of champagne and spray a crowd of cheering supporters :1st_place_medal:

image