DepthWise conv Inner loop f16 support

VariantXYZ commented 1 year ago

https://github.com/Rikorose/DeepFilterNet/pull/211#issuecomment-1353637586

Digging in a bit into why I was seeing so many f32/f16 conversions despite the A55 supporting fp16 storage and arithmetic, it seems like this is just a limitation of Rust’s f16 support.

To fully take advantage of FP16, I think avoiding these conversions is necessary… though, I’m not sure what the best solution is…

Maybe just rewriting the inner loop in assembly for f16 when the CPU says it supports f16?

Overriding the operators in the half crate might work too.

kali commented 1 year ago

Yeah, this is right, rust just does not have f16 support, so we must do it by hand. Assembly is a way, and we have infrastructure in linalg for this already, but intrinsic may work too. It would be a new "thing" in tract, so it needs a bit of thinking, but overall it may be less awful than going all the way through assembly and we have a couple of operators where intrinsic could help, at a lower cost than assembly.

I would be tempted to have them in linalg, the idea is that this is were the platform dependent code live, and linalg compiles and optimize pretty fast so it gets the dev iteration loop significantly more pleasant.

Do you already have some kind of list of the ops that should be ported to f16?

VariantXYZ commented 1 year ago

Do you already have some kind of list of the ops that should be ported to f16?

DeepFIlterNet2 (a very convenient example) at least is primarily bottlenecked by this function https://github.com/sonos/tract/blob/main/core/src/ops/cnn/conv/depth_wise.rs#L209 though I noticed some others. Every add/mul instruction on F16 requires a conversion to FP32.

I would be tempted to have them in linalg, the idea is that this is were the platform dependent code live, and linalg compiles and optimize pretty fast so it gets the dev iteration loop significantly more pleasant.

To be honest, for something like this, I would suggest handling F16 in C while Rust support is lacking, and just expose a C API that gets called within Rust... but that probably isn't ideal for tract's general organization. In this case, just handwriting assembly per platform is probably fine.

VariantXYZ commented 1 year ago

@kali ,

I tried looking into doing some of the inner loop in assembly but had a few questions:

          for (ix, v) in iter {
                let k = *kptr.add(ix);
                let i = *iptr.offset(v);
                sum = sum + k * i;
            }

looks like a simple dot product so I figured I could write this in C or assembly and just add it to the build.rs script in linalg, then expose that somehow, but I'm not sure exactly how this would best work. Since linalg already uses 'cc' for the assembly files, I figured this was the place to do it.

The benefit of C, in my opinion, is that as long as LLVM/clang are correctly passed a particular CPU/feature set, they will automatically handle conversions as necessary with the standard _Float16 type if the target doesn't support it. When the target does support it, there should at least be no unnecessary conversions, and at worst-case it will be as fast as the current Rust code anyway for this loop.

Your current setup with mmm involves just using a function selector and type information, so I'm wondering if that's the correct route? There are other operators that would benefit from this as well, so I'm not sure on the best way to go about doing this.

kali commented 1 year ago

As you noticed, beyond a very relevant addition, this is also a sandbox for other operations that could benefit from having platform specific and/or optimized code.

So just want to make a couple of things explicit: tract assumes a "compiles once, runs everywhere" model, to the extent of a cpu family like as in "x64" or "arm v8.x". So we ship several variants of the same code, some optimized one way, some optimised another way and some dumbed down. In assembly this comes relatively naturally. To transpose this to the C world is not super hard, we need to compile variants of the same function, or the same function varying the compiler cpu features flags, and alter the function name according to the setup. Then at runtime the function selectors can check what's available and make as good a choice as practical depending on the cpu and possibly some hyper parameters of the operation.

To be honest, I don't like very much the idea of bringing in C. But I think it's the pragmatic way to go for cases where the Rust intrinsic route is not possible (as far as I can tell, f16 intrinsics are not baked in yet). I don't think we should invest too much on the C route, I don't think we should go beyond simple implementations (as in, no C intrinsic). If we want to go further, take advantage of manual vectorisation, we have the assembly route, and the rust intrinsic route (not for f16 at this stage).

We want to contains all of this complexity inside linalg. So we may also need to iterate a bit over the right interface to expose to tract-core. At least if the first round of implemenations are in C, it should be relatively easy to adjust...

What we want here is a dot-product with arbitrary offset and possibly choice of types: having a f32 dot product would make sense too (that's the mmm 1.k.1 case). That's what xDOT offers in BLAS, so we must not be too far from the truth.

https://netlib.org/lapack/explore-html/df/d28/group__single__blas__level1_ga37a14d8598319955b711af0d64a6f56e.html

VariantXYZ commented 1 year ago

Got it, so it looks like writing a quick dot product for fp16/fp32 in C is fine for now.

My next question is hopefully a more straightforward one: Are there any guarantees that the data is contiguous, or the increments are constant?

            for (ix, v) in iter {
                let k = *kptr.add(ix);
                let i = *iptr.offset(v);
                sum = sum + k * i;
            }

Hopefully I'm not misunderstanding, but I'm also a little confused as to why kptr is directly added to while iptr requires an offset. I understand the difference between the functions, but just found it a bit odd that we have a direct pointer offset and an index from the same iterator.

kali commented 1 year ago

Mmm... you're right, it a bit more complicated than BLAS xDOT, because there is an indirection.

The idea here is that inner_loop is called for each value to compute in the output, with visitor setup beforehand. Depending on the location of the output value (are we in the central valid zone ? or on an edge or corner ?), the valid_offset_ker_in iterates over the valid pairs of kernel and image values.

For instance, applying a 2D 3x3 depthwise convolution over a NCHW image. When we are away from boundaries, the iterator will give use successively (0, -W-1),(1, -W),(2, -W+1),(3, -1),(4, 0),(5, 1),(6,W-1),(7,W),(8,W+1). When we are somewhere on the middle of the left border, we will just get (1, -W),(2, -W+1),(4, 0),(5, 1),(7,W),(8,W+1).

The actual computation of the in-image offset can be more complicated (if C is the inner axis for instance), but the good news is that it is constant over the image, modulo the edge masking that the visitor handles for us.

The iterator does not make our job easier here, we will want to call the new function with buffers. We can try a strict replacement, passing directly scanner.zone.values_offsets

void f16_dot_with_bases_and_offsets(float16 *sumptr, size_t count, float* kbaseptr, float* ibaseptr, size_t *offsets_pairs)

See https://github.com/sonos/tract/blob/58d22b6386961fa3ad6ad97c2b0e4727662daec6/core/src/ops/cnn/patches.rs#L414 for what the iterators does: we can handle the input_center_offset directly on top of ibaseptr.

Alternatively, we could compute the address rust-side, getting a somewhat more generic function to the price of managing extra buffers...

void f16_dot_ptrs(float16 *sumptr, size_t count, float** kptrptr, float** iptrptr)

I think we should experiment with the first option first. WDYT ?

VariantXYZ commented 1 year ago

Will try the first option.

To clarify though, offset_pairs would be a 2D size_t array right? And you'd need to pass input_center_offset, right?

VariantXYZ commented 1 year ago

Ah, one other quick thing: for fp16, it looks like the dummy fmla file for pragma testing should specify fullfp16 for fp16 arithmetic support (fp16 on its own apparently only guarantees storage).

kali commented 1 year ago

Will try the first option.

To clarify though, offset_pairs would be a 2D size_t array right? And you'd need to pass input_center_offset, right?

offset_pairs should match what is coming from values_offsets, so we can pass it unmodified, without needing intermediate buffers. I'm not 100% sure if the layout is known here or if we need to sprinkle some repr[C]in patches.rs...

https://github.com/sonos/tract/blob/58d22b6386961fa3ad6ad97c2b0e4727662daec6/core/src/ops/cnn/patches.rs#L354

kali commented 1 year ago

Ah, one other quick thing: for fp16, it looks like the dummy fmla file for pragma testing should specify fullfp16 for fp16 arithmetic support (fp16 on its own apparently only guarantees storage).

Ho, damn, I thought I had found a working combination here. Did you find an example that is not working ? Because I have some toolchains that have proven a bit finicky in the past and that I can't break, so I'm gonna move with lots of caution on this one (=> separate PR if you need to change it).

VariantXYZ commented 1 year ago

Did you find an example that is not working?

Just building with the latest nightly with llvm/clang 15 for cortex-a55 seems to trigger this warning. I haven't tried with rustup's installed default yet.

I've managed to do the first step of moving the code to linalg in a 'dotprod' implementation, but now I'm stuck trying to figure out the best way to support it generically and fallback on implementations as available.

I could do an implementation for every type, or use nightly specializations (probably not happening if I want it to merge), or do it the same way your mmm setup is with a generic implementation that has an effective vtable.

kali commented 1 year ago

Well done. Benched it already ?

Indeed, nightly is a no-go. Unless I'm mistaken our example is pretty simple here, the only selection parameter being the type. So instead of mimicking the mmm setup, maybe the sigmoid/tanh simpler variant could be use (exposing the function directly, without a complicated selector).

VariantXYZ commented 1 year ago

Ah, sorry for misleading you. No I haven't actually gotten that far, I was actually getting stuck trying to figure out how to properly dispatch it.

It's actually a bit silly... I simply moved the generic functionality to linalg/generic/dot.rs and am calling it directly. I see the use of element_wise macros in core/ops/math but I've been trying to wrap my head around the dispatch setup and how I can force the inner_loop function to specifically fallback on the f16 implementation for testing.

https://github.com/sonos/tract/compare/main...VariantXYZ:move_dot_prod

I'm not too worried about adding the external calls to C, but right now trying to properly have the inner_loop dispatch based on type is what I'm a bit lost on.

VariantXYZ commented 1 year ago

(For now I'll just force it to call into an external C function for testing)

kali commented 1 year ago

Mmm... I don't think the dispatch_* macros will help you there: they are useful when you find yourself in some code which is abstract over Tensor and DatumType, not type-parametric, and need to switch to actually DatumType-monomorphized functions instances.

I think we have several options here.

same way than for sigmoid and tanh, we add a dot_with_offset_f16 to the ops in lineal. Then what we need is a check on T::datum_type() == f16::datum_type() to switch on the linalg-provided impl (we need to duplicate the f16 generic impl in linalg as a fallback) and keep the generic impl in core for non-f16 types.
more like mmm: we have a dot_offsets(dt:DatumType) in linalg that returns a Box<Fn(...)> with type-erased pointers in the function prototype. The generic impl moves entirely to linalg.
what you're hinting at so far (maybe by default....) is a different approach: type parametric functions in linalg. I am not 100% sure we want to go there, but you can try and convince me.

VariantXYZ commented 1 year ago

I'm going with the first option for testing, but I think the main issue is actually going to be with offsets, as I don't believe Rust offers any FFI guarantees on a slice of tuples.

I might just convert the tuple into a struct with repr(C) instead and hopefully that won't cause too much trouble.

VariantXYZ commented 1 year ago

@kali , I ended up just going with a bit of a lazy solution for now. https://github.com/sonos/tract/compare/main...VariantXYZ:move_dot_prod

I did verify the output file will correctly generate half-width floating point instructions for a55, but output seems completely wrong... I'm sure I have some silly mistake somewhere.

The allocations for splitting up the vector are unsurprisingly very painful too, any gains we might've gotten would be demolished by the memory operations in such a tight loop. Think the easiest way to handle this is changing value_offsets to a repr(C) tuple structure and having that defined on the C-side as well.

Even without the fp16, something in that latest commit seems to cause a slight slowdown. If I had to guess, it's the Datum type check in the inner_loop.

VariantXYZ commented 1 year ago

Some good news though:

I commented the C dot product function out and the allocations, and instead just substituted the multiply and add with a quick madd_f16 function written in C (just passing the pointers after handling their offsets), this led to a significant speed up. Almost at parity, but it should be better...

A profile showed a large chunk of time still being spent on conversions, and they seem to be happening in core/src/ops/binary.rs in "eval_unicast_in_place".

I think I might have to back up and ask for your help for integrating at this stage. At least as a proof of concept, there's definitely potential here. Just need to avoid casting and unnecessarily allocations in tight loops.

Just as an afterthought, I think some of the experimentation with static instead of dynamic dispatch shows nontrivial gains with a model like DeepFilterNet, which operates with high frequency. Depending on the use-case, it might be worth investigating...

kali commented 1 year ago

I'm surprised by the need to multiply by sizeof here. https://github.com/sonos/tract/compare/main...VariantXYZ:move_dot_prod#diff-9fd28fecd20773f8aeba8df47573913c130ce5dce2179af59463a1f57558ca4eR14

Maybe the source of confusion here is about add and offset semantics around rust pointer: they are doing the same thing, add operates on a usize and offset on a isize. And they are both doing the same thing as C ptr arithmetics, counting elements not byes. So the problem may just be the * sizeof(_Float16).

I don't foresee any big issue in replacing the tuple by a repr[C]. Still, if we want to do this, we need a preliminary PR : that way I can run my private benchs and check if it has impact on other parts (like the regular convolutions).

Another way is to use some "scratch" space so we don't have to perfom allocations. Managed at the outer loop level, and used by the inner loop.

The madd_f16 experiment is an interesting datapoint, but it will be hard to compensate a function call for every single elementary operation, so let's keep this as a fallback option. We haven't exhausted the options on dispatching at the dot level.

My intuition says that the sizeof() thing is the bug. If it's not, we will find it anyway, it's a three-line function. So, I think, first we make it work, regardless of perf, with buffer allocations in the inner_loop method. Then we try replacing the tuple by a repr(c) struct and I can assess it has no impact on performance anywhere else. With these two checks done we can move on to integration proper. WDYT ?

VariantXYZ commented 1 year ago

My intuition says that the sizeof() thing is the bug.

It was, I think I'm in a holiday mindset already.

I don't foresee any big issue in replacing the tuple by a repr[C].

An alternative is to just not deal with it as a tuple and have the values separate (which would simplify the C implementation, since there'd be no need for a structure definition either).

The madd_f16 experiment is an interesting datapoint, but it will be hard to compensate a function call for every single elementary operation, so let's keep this as a fallback option. We haven't exhausted the options on dispatching at the dot level.

I think this is mostly just verifying that there are improvements to be had, this was just a way of testing without the unnecessary vector allocations in a tight loop.

Pushed it up and created the PR for your evaluation. Feel free to close and cherry-pick or use the implementations how you will. No preference here, I think I won't be able to work on it until next year anyway. I really only tested with llvm/clang, and I didn't test a Windows binary either.

kali commented 1 year ago

🎄

VariantXYZ commented 1 year ago

@kali ,

Happy New Year! I'm back but honestly I don't know what the next action item is here.

The PR 'works' for the platforms I was testing, but seems to fail on others... I'm a bit unsure of the best way to move forward, as my gut feeling is that it would just need some modifications to the architecture to avoid repeated casting and allocations and skip the f16 build entirely when it's not supported, but I think this is a bit out of my scope of knowledge.

VariantXYZ commented 1 year ago

Hi @kali ,

I started digging back into this, and found a few interesting things:

Swapping values_offsets to koffsets and ioffsets separately is very useful (and avoids allocations in a tight loop), but there's no way to really take advantage of F16 NEON because two arrays aren't contiguous (this actually even prevents any NEON for any data type)
inner_loop is actually called so often that even doing a T::datum_type call is enough to add a 2%-3% overall load (compared to the MM's 30%, on my profile)... I noticed that it was set to 'inline(never)' but just changing it to 'inline' was enough to significantly improve performance

So I have two questions:

Any ideas on how we can efficiently provide two contiguous arrays for faster dot products?
Is there a reason to keep inner_loop as no-inline?

VariantXYZ commented 1 year ago

Also, I wanted to see about testing some C implementations of mmv operators of different sizes to compare to the 128x1 case, but I can't seem to figure out a simple way to just call a gemm function like:

void mmv_f16(int64_t m, int64_t k, const _Float16 *a, const _Float16 *b, _Float16 *c)

Since the current MMMKernel implementation is a bit involved and expects tiling... I'm probably missing something really simple. Any ideas on how I could try testing this?

kali commented 1 year ago

About inline(never), I'm using this sometimes because in some situations the compiler picks the wrong variables to set in registers and what to put on the stack. Separating the function puts an explicit boundary and hopefully the variables in the tight loop have more chance to be left in registers. Changing the inline may or may not have an impact on the other scenarios, these things tends to vary on the architecture (because register pressure on arm and intel are very different) and on the compiler version...

About making the array contiguous now. That is the strategy for regular convolution (classical Im2col, merged with packing in tract, to get the data in the best shape possible for mm kernels consumption). It does pay on the regular conv case, because we are moving the data once and we use it repeatedly for each output channel. It's much less clear on the DepthWise case, as each channel is only used once. Basically, the convolution/matrix multiplication case is dominated by arithmetic operations, and we found ourselves in the "happy zone" for which CPU have been sized, saturating roughly both the arithmetic and memory units. DepthWise does a good job at reducing algorithmic complexity, so the ALU is no longer saturated, but we now found ourselves constrained by the memory accesses...

There is an easy test actually, to help getting an intuition of that: modify ConvUnary so the DepthWise case is ignored and the code goes to one of the generic Im2col approaches. That means you will actually put the data in the best possible layout, then call the actual multipliers.

VariantXYZ commented 1 year ago

This may be a silly issue on my part, but one thing that I ran into when rebasing DeepFilterNet's half-float support branch onto the latest update to support tract 0.19.0 is an issue with the output shapes not matching the input shapes:

Error: Translating node #6 "Conv_20" ConvUnary HalfTranslator

Caused by:
    0: wiring Conv_20 (ConvUnary { pool_spec: PoolSpec { data_format: NCHW, kernel_shape: [1, 3], padding: Explicit([0, 1], [0, 1], false), dilations: Some([1, 1]), strides: Some([1, 2]), output_channel_override: Some(64) }, kernel_fmt: OIHW, kernel: 64,1,1,3,F16 -0.046844482, -0.18811035, -0.18762207, -0.1361084, -0.113220215, 0.0025100708, -0.011001587, -0.13391113, -0.2631836, 0.09509277, -0.11694336, -0.16540527..., group: 64, bias: None, q_params: None }), determining output_facts
    1: in output_facts invocation
    2: Convolution input and weights must have the same type. (resp F32 and F16)

The rebase was a bit messy, but it seems like everything was as it was in @Rikorose 's original branch here (I removed the half_float conditionals while trying to isolate the issue), which was working in 0.18.x.

fn init_encoder_impl(
    mut m: InferenceModel,
    df_cfg: &ini::Properties,
    n_ch: usize,
) -> Result<TypedModel> {
    log::debug!("Start init encoder.");
    let s = m.symbol_table.sym("S");
    let dt = f32::datum_type();

    let nb_erb = df_cfg.get("nb_erb").unwrap().parse::<usize>()?;
    let nb_df = df_cfg.get("nb_df").unwrap().parse::<usize>()?;
    let feat_erb = InferenceFact::dt_shape(dt, shapefactoid!(n_ch, 1, s, nb_erb));
    let feat_spec = InferenceFact::dt_shape(dt, shapefactoid!(n_ch, 2, s, nb_df));

    m = m
        .with_input_fact(0, feat_erb)?
        .with_input_fact(1, feat_spec)?
        .with_input_names(["feat_erb", "feat_spec"])?
        .with_output_names(["e0", "e1", "e2", "e3", "emb", "c0", "lsnr"])?;
    m.analyse(true)?;

    let mut m = m.into_typed()?;

    m.declutter()?;
    let pulsed = PulsedModel::new(&m, s, &1.to_dim())?;

    log::info!("Init encoder");

    let mut m = pulsed.into_typed()?;   
    use tract_core::model::translator::Translate;
    m = tract_core::half::HalfTranslator.translate_model(&m)?;
    m = m.into_optimized()?;

    Ok(m)
}

VariantXYZ commented 1 year ago

One thing I think I forgot to mention:

The cross-language LTO approach has the benefit of allowing for full FHM support (FMLAL!). I think this is rather significant. I haven't been able to figure out a way to do this cleanly in the half-rs crate to submit there...

I suppose it's not too big an issue for tract though, as the FMLA instructions are handled with assembly anyway.

sonos / tract

DepthWise conv Inner loop f16 support #916