ARIMA performance tracker

Nyrio commented 3 years ago

The performance of ARIMA is currently far from the speed of light. We have focused on adding features so far. This issue is a tracker of performance optimizations that can be done in ARIMA if we resume the work on this model.

ARIMA

[x] P0: remove various unnecessary allocations and memory transfers - #3895
[x] P0: single-kernel Kalman filter - #4006
[ ] P1: hardcoded loops instead of MM/MV
[ ] P1: move the optimizer to C++, and the parameters to device arrays
[ ] P1: float32 support - #1256
[ ] P1: complete code refactor and simplification of call tree (replace with a class?)
[ ] P1: steady-state detection in the Kalman filter
[ ] P2: alternative initialization methods to bypass the expensive Lyapunov equation?
[ ] P2: persistent Kalman matrices R, T: re-write only the variable part
[ ] P2: provide copy/move semantics to Batched::Matrix - #1692 (or remove operator overloads in favor of functional operations!)

Auto-ARIMA

[ ] P0: CUDA STL decomposition - #2235

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

singlecheeze commented 1 year ago

@Nyrio I am running ARIMA across a fleet of GPUs and noticed this post from a few years ago.

Will the P1/P2 items be addressed?

I've noticed a number of times a debug message printed:

[W] [09:52:05.872442] fit: Some batch members had optimizer problems

Does this have something to do with, "move the optimizer to C++, and the parameters to device arrays" above?

Additionally, I've seen a consistent relationship between high single thread CPU utilization of the python process and an active kernel running on a GPU. Is this due to the optimizer running on CPU (Even though it seems to persist for the duration of the batch of series)?

Lastly, I've also noticed that there is very little difference in average batch duration for a set batch of series and length (5000 by 120) across a number of GPU types/architectures. All the gpu's below are "fed" from:

CPU: 48x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GH

Do you have a hypothesis of why this would be?

AutoARIMA

ARIMA

Nyrio commented 1 year ago

Hi @singlecheeze , If you're running simple ARIMA (non-seasonal), the performance is likely limited by the Python-side optimizer or GPU underutilization. For small batch sizes like 20 it is probably the latter because we do not fill the GPU (but I would need a profiling timeline to know for sure). For non-seasonal ARIMA, you need batches of a few hundred series to utilize it. For seasonal ARIMA smaller batches are generally sufficient. If the bottleneck is the CPU optimizer or if the batch size is too small to fill the GPU, it's normal not to see much difference between GPU models.

The message you're seeing indicates that some batch members didn't converge. This is not related to performance or whether the optimizer is implemented in Python or C++.

Regarding the P1/P2 items, this work is currently on hold. If you have a good use case for batched ARIMA, feel free to reach out to tell us more about it.

singlecheeze commented 1 year ago

@Nyrio thank you for your insight and prompt response!

I'll post a profile here in a few minutes, but a quick question, when you say, "fill the GPU", is this in regard to memory, or the dynamic of SM/Warp/Blocks and stuff?

If in regard to Stream Multi-Processors, I have attempted to get metrics on this though it seems quite challenging. The nvidia-smi only gives a utilization metric based on time there is an active kernel (application/process), so it is very hard to get an idea of the amount of utilization of the actual count of SMs on any given card. Is there a better way? (I know there is a very large amount of discussion that could be had on CUDA optimization, I just don't know if there is an easy metric I am missing somewhere...)

Nyrio commented 1 year ago

Ah, sorry, what I meant by "fill the GPU" is to give it enough work to efficiently take advantage of it. To put it simply if you have a GPU with 80 SMs and the main kernel for ARIMA only submits 1 block, 79 SMs are completely inactive.

Getting metrics on this is usually done with the Nsight profiling tools (Nsight Systems to get a timeline, Nsight Compute to analyze each kernel in detail). But this is mostly useful for the developer of the algorithm who knows the implementation. It is of little use if you don't know the implementation details. I hope this helps.

singlecheeze commented 1 year ago

@Nyrio IT HELPS A TON!!! :) I would love to see a doc on more granular detail like this, or even better, some additional API logic that can scale/fully utilize the resources available on the GPU (I have no idea if this is done today?). I've thought of this approach external to the cuML API, but it's hard to measure without more granular "GPU internals" metrics. I'm far from understanding the dynamic between blocks/warps, and how kernels get distributed on the GPU.

I had a sneaking suspicion that I was only invoking ONE SM on each GPU... running Nsight compute now.

Nyrio commented 1 year ago

We try to always deliver the best performance for the user without them having to worry about implementation details. For an algorithm like batched ARIMA which delivers performance at scale, we documented that it is best used with large batches.

While profiling tools are most useful for algorithm developers to improve the performance of the implementation, and less to the end users of the library, you seem interested in trying those tools, so I can recommend starting with NSight Systems to get a nice overview of the application performance. You see the kernels, their execution times, CUDA API calls, launch configurations, etc. But it will really not be of much use without knowing the implementation details.

singlecheeze commented 1 year ago

@Nyrio you are 100% correct, that what the below means to me, I have no idea!

Taking another approach (I'm very grateful for your input): What is the recommended method of invoking the GPU? Should I be submitting more batches (Maybe one batch per SM) if I know my batches are closer to the 20 series size?

There is a certain cost/benefit analysis of batch number of series and how quickly a forecast completes. I understand there are optimizations for a few hundred non-seasonal series like you stated above (This is a great piece of info!), but since this is not multi-GPU, and for my use case the product of the forecast is somewhat time sensitive (within 15 minutes or so is my target), I don't want to submit 300 series to one GPU, when it maybe beneficial from a time perspective to submit one, 100 series batch to 3 GPUs.

Most of my series are about 5000 long. I have arbitrary limited each batch to 120 due to completion time and what looks like a bug I will include below.

Nyrio commented 1 year ago

By 5000 long, do you mean 5000 observations? What I refer to by batch size is the number of series you pass to the algorithm at once. Say you have 10k series of length 5000, for example, that would be calling the algorithm 10 times with 1000 series of size 5000 (here batch_size=1000, n_obs=5000).

Regarding the choice of batch size, it is a matter of experimentation, but here is some advice: up to a certain point, the fitting time does not scale linearly with respect to the batch size. E.g 10 series takes 0.5s, 100 series take 1s. Increase the batch size as long as this is the case. This is because GPUs can process many series in parallel, so you need to find a batch size big enough to "fill the GPU". Typically it will be a few hundreds or even thousands.

If you have a bug, please open a new issue for it. This discussion is interesting but off-topic on the current issue.

singlecheeze commented 1 year ago

Ahhh yes, ok, so my working set is about 11k series, batch_size (maximum, this is configurable)=120, n_obs=5000.

I tried to go above this but hit some issue (In this case submitting, batch_size=520, n_obs=922, will open separate bug as you describe):

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Nyrio commented 1 year ago

I tried to go above this but hit some issue (In this case submitting, batch_size=520, n_obs=922, will open separate bug as you describe)

Yes, please open a bug and tag me. If you can make a sample with fake data to reproduce the issue, that would be helpful.

singlecheeze commented 1 year ago

@Nyrio another question in an effort to drive up the number of series in each batch... I see this note in the docs:

My series are all contiguous but right now I have to filter series by length in order to submit a consistent batch to make sure all the series have the same length. Will ARIMA run if series of different lengths are "padded" at the beginning with NaN? If so, what is this datatype? Numpy NaN? https://numpy.org/doc/stable/reference/constants.html#numpy.NAN

Nyrio commented 1 year ago

Will ARIMA run if series of different lengths are "padded" at the beginning with NaN?

Yes, ARIMA supports padding with missing values at the beginning if you have series of different lengths. I think numpy.NAN should work.

singlecheeze commented 1 year ago

@Nyrio it seems something isn't right with series padded with NaN (At least with cuDF input), see: #4967

Your help is greatly appreciated!

singlecheeze commented 1 year ago

@Nyrio for the large series items we were discussing above, very puzzling: #4968

rapidsai / cuml

ARIMA performance tracker #2912

ARIMA

Auto-ARIMA