Instructions on training new cost_models

VariantXYZ commented 1 year ago

It would be great to take advantage of the cost_model setup for arbitrary ARM CPUs (like the a57) and generate better cost models for then, but I’m not really sure on the procedure.

Digging in a little, it looks like the cost_model binary gets run on the platform and then that data gets processed by the train script, which generates the file. Seems straightforward, but there’s a lot of parameters that I’m not really sure about…

kali commented 1 year ago

Yeah, it feels a bit like a dark art. The Cortex-A57 is an out-of-order chip, so the "_gen" variants will probably operate pretty well. There is a simpler test you can do it if you have an actual device and model, before we go all the way through data collection and training: 1/ for a model of interest, on the device of interest, get a profile with tract model.onnx -O dump --info --cost --profile --allow-random-input 2/ pick the LirMatMulUnary(s) of interest (cost and profile on the left will tell you which one or ones to look at first). Grab the "Mult:" info line. It gives you the m,k and n parameters, and the actual choice of implementation that was done. 3/ runcost_model time [m] [k] [n] on the device. It will try all available implementations and time them. You can check if there are implementations which consistenly out-performs significantly the heuristic choices that were made... And from there we can see if it looks worthy to push this further on.

VariantXYZ commented 1 year ago

Thanks for the reply.

I spent some time just messing around with it and the information is fantastic.

Most big functions were using their “optimal” kernels, but I did run into an odd size that took up ~10%, with 64 x 384 x 1. It used the 64x1 gen kernel, but cost_model predicted 16x4 would be significantly better (by a large margin).

Maybe just overriding this would be good enough. The other large ones seemed to be correct.

VariantXYZ commented 1 year ago

Hm… though I wonder if 64x1 isn’t being tested?

Ah, was indeed the case. It wasn’t part of the default plug ops so putting it in there worked.

VariantXYZ commented 1 year ago

Ok! After plugging it in and looking over every matmul m/n/k and checking with cost_model, it mostly lines up with a few notable issues:

32x9x96 (3% of total time) uses 8x8, but 16x4 is ~3% better (not an issue, it’s very minor)

32x128x1 (5.3% of total time) uses 64x1 but 16x4 is 36% better.

1x256x1 (1.2% of total time) uses 64x1 but 8x8 is 68% better.

So it’s fairly accurate, but there’s definitely a little performance up for grabs if it’s not too difficult to setup a cost model specific to this use-case.

Edit: some more tests revealed a few more small increases, which aren’t too large but they add up to a few percent here or there.

kali commented 1 year ago

32x9x96 (3% of total time) uses 8x8, but 16x4 is ~3% better (not an issue, it’s very minor)

That one honestly baffles me. I've seen it before, so you measurements are probably correct. It does not make sense in my mental framework of how fast these things run... a square kernel should be faster than a skinny one. I must be missing something somewhere, but that's a discussion for another day :)

32x128x1 (5.3% of total time) uses 64x1 but 16x4 is 36% better.

1x256x1 (1.2% of total time) uses 64x1 but 8x8 is 68% better.

Ok. So we are over a couple % of total time. Let's try to see if we can train our current model. Of course, there is no warranty that the model will perform better... I will give you instructions on how to perform the data extraction and training. I have one concert, which is... if we include the cost model, what will happen in 6 or 12 months when we add new kernels to the mix ? I will need a Cortex-A57, or will need you to be around to retrain. Cortex-A57 boards are pretty expensive from what I can see, I'm trying to see if one of our teams does not have a Jetson Nano collecting dust in a drawer somewhere.

In the meantime: 1/ first you need to run data collections. For the devices that I own, it is driven by CI, and run with this command line: https://github.com/sonos/tract/blob/main/.travis/cost_model_task_build.sh#L48 cost_model ds --size 10000 data.txt This will take a couple of days. Make sure your device will not overheat and switch to a lower frequency. I often put a small USB fan in front of device that only have passive heat dissipation when I run these...

2/ Iterate a bit over training. You already found the pytorch training script. You can look at the runme.sh script that deals with "plumbing" around the python script, pulling data from s3 and writting the model in the right place, but you'll be calling the python script directly. Before that, you will need to insert a few "sanity boundaries" in the script: we trained 15 models, and pick one that take some valid decisions encoded in the script there: https://github.com/sonos/tract/blob/main/linalg/cost_model/train/train.py#L294 So you need to add a Cortex-A57 section here with what you found manually and matters. If you put too many constraints, you may not obtain a working model, so there's a bit of trial and error here. The model is tiny, so training time is not a concern.

VariantXYZ commented 1 year ago

I have one concert, which is... if we include the cost model, what will happen in 6 or 12 months when we add new kernels to the mix ?

I think it would be nice if the method to train for a particular CPU + model was just well-documented enough that it could be done per use-case, and whatever is checked in the repository would be good enough as a general case. The gains are noticeable but overall minimal to be honest.

kali commented 1 year ago

I agree on principle. But I'm better at finding reason against documenting than actually documenting. :) Here, I could argue that the model costing is still experimental, that I may need to change the model, so I don't want to multiply the trained models, so out-of-order cpus are excluded, etc. I will try to find motivation to at least put the notes in this thread together as a makeshift documentation.

This is getting me curious though. Maybe a common model could outperform the heuristics on all out-of-order cortex. And maybe also the apple cpu (or one model or ooo cortexes, and one for apple silicon).

Another approach I was thinking about would be to allow overriding the choices on a per-application basis: knowing beforehand that you distribute a model for a given application on a given device, put hints besides the models to make sure tract picks the right implementation. (Maybe this is actually what you are implying in your last comment actually).

Finally, on the good news front, one colleague of mine actually had an idle Jetson Nano, so I may be able to handle the A57 cases in the same semi-automated way that I'm dealing with A53 and A55.

VariantXYZ commented 1 year ago

Maybe this is actually what you are implying in your last comment actually

Yep. I was trying to think of a way to do this automatically but couldn’t think of a clean way, so I fell back on “document it and let a user do it”.

By the way, I am running the cost model now and it’ll be done Saturday… probably, but it seems to be locked on one core. Any way/benefit to multicore?

kali commented 1 year ago

Running monocore is more or less by design. tract executor is monocore anyway, so it's closer to the "production" settings like that. I also hope it may help with overheating (but I'm not sure about this). Finally at least, it leaves three cores for the OS to play with, so anything coming out of cron or whatever should not disrupt the measurements too much.

VariantXYZ commented 1 year ago

One thing I didn't catch yesterday: the 64x1 kernels are not included in the impls list by default, I had added them when doing measurements and so I ran the cost_model dataset collection with them included (otherwise the 'time' test wasn't correctly comparing them).

What this would imply is that the operation in question is actually f32_mmv, which is forced to use 64x1 and doesn't really use the cost_model at all AFAICT: https://github.com/sonos/tract/blob/main/linalg/src/arm64.rs#L125

Also, when running a profile on some models as-is, I noticed my number 1 usage was actually the 64x1 kernel anyway (taking up ~50% of my total time).

So I wonder if this would really help at all, if the model is only used for 'mmm' ops?

VariantXYZ commented 1 year ago

So I wonder if this would really help at all, if the model is only used for 'mmm' ops?

(I just decided to modify lib.rs to skip the mmv call and just go straight to using 'mmm' even for n = 1)

put hints besides the models to make sure tract picks the right implementation. (Maybe this is actually what you are implying in your last comment actually).

One other thought on this is that rather than bother with 'hints' even, if a particular use-case and platform are known then it should be fine to just provide something like a giant match for known sizes.

Given the info in this issue even, it should be possible to take a model (or set of models) and generate a list of all (m, n, k), take that list and pass them to cost_model for timing and 'optimal' kernels, and just generate a match statement for that, with a platform-specific fallback. No need to rely on a model's approximations when your use-case is mostly known.

From tract's end, to support this would just require exposing some way to define a 'pick' function for kernels... though ideally this would be done in a way that doesn't violate tract's platform-agnostic APIs. I wouldn't mind trying to implement something like this, but what do you think @kali ?

kali commented 1 year ago

About the mmm/mmv... mmv is actually a harder problem to optimise, as most cpu and soc vendors will target the mmm use case: square-ish product have a higher multiplication/memory access ratio. So for the mmv case, it's very hard to get high gflops throughput because the cpu will always be waiting on the memory.

But NN are full of mmvs, so I chose to make them a special case and optimise them as much a I could anyway.

So it is possible that for skinny ops (n=2 or 3), calling the mmv for each n may be better, if it does not incur an extra data permutation (I'm not sure how this work anymore, I would have to check). Specifically for n=2, the n=4 kernel will "waste" half of its time doing products on two columns nobody cares about.

For the hints, I think listing all the m,k,n (and may dt) along with the wanted choice is valid...

Trying to think a bit about how to implement this...

setup a way to configure linalg::ops() before loading the model (or at least optimising it). I'm a bit un-easy with this because it's basically about putting logic in global variables, with possible side effects on other models.
linalg::Ops() already expose a list of possible implementation for the current architecture (the one we use in cost_model). We could just have tract-core go through the list and pick a pre-decided kernel instead of calling mmm(). If the wanted kernel is not there, because we're on the wrong arch or kernel list have changed, we can fallback on mmm(). That way we move the responsibility to handle the hints into the core ops and models.

I'm welcoming contribution, and I think this is a very valid one. I just want to warn you that I am deep in a epic refactoring of the matrix multiplication code in tract-core (the main idea being to replace the decluttered mir-matmul by a form of EinSum). I don't know yet when this is gonna land, I keep finding new stuff on the way that I prefer to fix first. It should not change the interaction with linalg though.

VariantXYZ commented 1 year ago

linalg::Ops() already expose a list of possible implementation for the current architecture (the one we use in cost_model). We could just have tract-core go through the list and pick a pre-decided kernel instead of calling mmm().

This was what I was thinking, I'll think on it a bit more myself. Even passing in an optional list as part of the model initialization would be a pretty 'clean' way to do it externally (a different list per platform, it can just be configured on the user's end).

So for the mmv case, it's very hard to get high gflops throughput because the cpu will always be waiting on the memory.

The dataset finished generating, but unsurprisingly, even with various constraints, I've somehow made applications ~50% slower (the 'accuracy' of the model is at best around 75%).

Profiling leads me to believe the 64x1 kernel is my number one bottleneck in a lot of cases (entirely memory bound). In fact, with some models (testing DeepFilterNet2 as I had interest in it), the 64x1 kernel is more than 50% of my total time. It's also usually the optimal model for mmv, as you surmised earlier, so this makes sense.

Out of curiosity, what's the reasoning behind 64x1 in particular as opposed to a smaller block like 16x1?

kali commented 1 year ago

DeepFilterNet2 is very challenging if you have a low-latency use case and need to run frame per frame. Grouping frames together (like pulse of 4 or 8) will improve performance for the convolution parts (the GRU will be more or less unaffected).

Reasoning behind the 64x1 is... mostly amortize the kernel setup overhead time I guess. 64lanes implies 16 "useful" cycles to run the multipliers, this start to feel relatively small compared to the rest, even if it can generate a bit of waste.

VariantXYZ commented 1 year ago

Reasoning behind the 64x1 is... mostly amortize the kernel setup overhead time I guess.

Viewing the results I got:

32x128x1 (5.3% of total time) uses 64x1 but 16x4 is 36% better. 1x256x1 (1.2% of total time) uses 64x1 but 8x8 is 68% better. ... I've somehow made applications ~50% slower (the 'accuracy' of the model is at best around 75%).

On the topic of the cost_model, maybe it would be worth having a 'pick' function for mmv as well. Maybe a different model or hint-list altogether? It seems the 64x1 kernel is the 'best' kernel in most cases, so trying to use the same model that won't pick the 64x1 kernel as often is clearly resulting in worse performance...

Grouping frames together (like pulse of 4 or 8) will improve performance for the convolution parts (the GRU will be more or less unaffected).

Sorry if this should be obvious, but I'm not really sure what you mean by 'pulse' in this case (I noticed it in DFN2 as well, but it didn't seem obvious what a pulsed model was). I did happen to notice that retraining the model for a larger frame was almost linear in how it affected performance (jumping from 10ms -> 20ms double performance, almost perfectly), but I didn't really change any parameters related to how tract was being setup.

kali commented 1 year ago

These findings are interesting... I never spent too much time on the tiny-m, tiny-n case, but maybe it's time :) I think what we are observing here is memory access domination: a kernel tile run will perform m*k*n products, and k*(m+n) memory accesses. So with global m=1 and n=1, both 8x8 and 64x1 can do it in one single pass, they do the same number of products, but the 64x1 does 65 access while the 8x8 does 16...

Maybe we should add a couple of kernels to the mix at this stage. Maybe a 16x1, a 4x1 and a 1x1 could make sense (or a totally specific treatment for the 1x1).

I'm gonna ping @Rikorose at this point, he may be interested too.

About the pulsing now. I wish it was obvious, but it's not :) From this conversation, I'm assuming you're working on an interactive or real-time sound processing application. In this context, at some point you have to translate your training network to a form suitable for streaming evaluation. The training form operates on the full length signal, while the streaming form processes frames in chunks that I call pulse in tract context. Transforming the network can be done from the API at the application loading time, or before-hand at "model-cooking" time, on your developper workstation, to get shorter startup time. In this case, you typically use "--pulse" on the command line, dump the network in tract extended NNEF and ship this NNEF file to the device with the application.

In its simplest form you pulse 1 frame at a time, but some application can tolerate a bit of extra delay. In this context, you may choose to process more than 1 frame at a time. This number is the pulse width. Going from 1 frame to 4 frames means you will go through the model 4 times less often. You will call tract State run method 4 times less often, so go through the model 4 times less often. The memory stress induced by moving the big weight tensors around will be reduced by 4. You will also unlock the x4 family kernels. Most of the time the pulse size goes to the n axis of the products.

I hope this helps.

VariantXYZ commented 1 year ago

I think what we are observing here is memory access domination

Indeed, a profile of these would show the dominating factor is the loads (more-so than other kernels).

Maybe we should add a couple of kernels to the mix at this stage. Maybe a 16x1, a 4x1 and a 1x1 could make sense (or a totally specific treatment for the 1x1).

This seems like the sane move, as memory bottlenecks seem to dominate. In this case, even a naive heuristic of just running kernels based on m would probably be a noticeable improvement.

About the pulsing now.

Got it, appreciate your explanation.

VariantXYZ commented 1 year ago

(Off-topic)

Out of curiosity I did try to pulse the DeepFilterNet2 encoder model with what I could glean from random notes:

tract enc.onnx --onnx-ignore-output-shapes -i 1,1,S,32,f32 -i 1,2,S,96,f32 --nnef-tract-core --nnef-tract-pulse dump --nnef-tar enc.nnef.tar seems to dump fine, but throwing --pulse 1 causes it to fail with No serializer found for Node #2 Conv_18.pulse-pad.

sonos / tract

Instructions on training new cost_models #909