End to end model speedup results

Darshvino commented 1 year ago

Hi XNNPACK team,

Is there any way to directly get the end-to-end model speedup results instead of building the TFlite with XNNPACK?

Basically, I just want to use the XNNPACK and get the model performance results. Also, I am curious to know about the numbers mentioned in the Readme https://github.com/google/XNNPACK#raspberry-pi for the end-to-model performance results, I mean did you build the TFlite with XNNPACK to get the model-level results or got directly from the XNNPACK?

It would be great if you can brief me about the procedure to benchmark the model for the performance results with the XNNPACK

Thanks

Darshvino commented 1 year ago

Hi @ngzhian @Maratyszcza,

Do we have any updates on the above?

Thanks!

ngzhian commented 1 year ago

Hi, you can run the benchmarks in https://github.com/google/XNNPACK/blob/master/bench/end2end.cc bazel run :end2end_bench to be MobileNet V1, V2, V3 for various data types.

Darshvino commented 1 year ago

Hi @ngzhian,

Thank you for your kind reply.

1.) What is the way to run the benchmarks on other models like(resnet, etc.)?

2.) Also, I did not understand the model files in this folder: https://github.com/google/XNNPACK/tree/master/models, I mean if you take any scripts in that folder and look at this https://github.com/google/XNNPACK/blob/e2fd580527ff74af8436e6f058563dd782d4c50f/models/fp16-mobilenet-v1.cc#L24 it's not clear about the definition. Can you please give a brief description of the model script? It would really help to write other models in the same way to benchmark.

Thanks

ngzhian commented 1 year ago

That file is the entire MobileNetV1 model, expanded out to manually call XNNPack operators. That's what TFLite is doing when using XNNPack delegate.

If you don't want to use TFLite, you need to manually call XNNPack operators, that file shows how it is done. If you want to run resnet, you need to convert the model to call XNNPack operators. XNNPack does not have ability to read TFLite flatbuffers, that is done by TFLite.

I suggest you use TFLite: https://www.tensorflow.org/lite/performance/measurement with with --use_xnnpack as the argument.

Darshvino commented 1 year ago

Hi @ngzhian,

Thanks a lot for your kind reply.

I will use https://www.tensorflow.org/lite/performance/measurement for benchmarking.

Also, I have one doubt regarding selecting the kernel while executing the model. If I want to execute the particular kernel(FP32 GEMM kernel with im2col) for executing the conv kernel, then what needs to be modified inorder to execute that particular FP32 GEMM kernel(with im2col )for Conv op in the model?

PS: I want to execute the GEMM kernel same way as here(I mean Im2col and FP32 GEMM): https://github.com/google/XNNPACK/blob/master/bench/f32-im2col-gemm.cc

Thanks

ngzhian commented 1 year ago

The convolution operator does not support im2col now, you likely have to manually modify https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L286 to add im2col support.

Darshvino commented 1 year ago

Ohh Okok got it Zhi. It would be a great help if you can assist in pointing out where exactly the indirection buffer formation is called and the GEMM kernel is called here: https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L286

Thanks

ngzhian commented 1 year ago

Indirection buffer is set up for IGEMM path here https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L1666-L1680 You will need similar code for the GEMM im2col. GEMM microkernel is set here https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L1560-L1618 to be called later in operator-run.c.

Darshvino commented 1 year ago

Thanks a lot, Zhi for the detailed answer! I will start working on it!

Darshvino commented 1 year ago

Hi @ngzhian,

How can we run the kernel benchmarks using multiple threads?

Suppose I am running: qs8-gemm-bench how can I run the kernel with 4 threads on raspberry pi?

Thanks!

ngzhian commented 1 year ago

Those benchmarks only run single threaded, you can modify the call to xnn_create_runtime to pass in a threadpool with multiple threads set: https://github.com/google/XNNPACK/blob/master/include/xnnpack.h#L1508

Darshvino commented 1 year ago

Hi @ngzhian,

Thanks a lot for your reply!.

Tbh, I did not clearly understand the above comment. Is there any possibility that you can explain in a bit of detail to run the kernel benchmarks with multi thread.

I want to run this benchmark with multi-thread actually: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm.cc

Thanks

ngzhian commented 1 year ago

https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm.cc does not support multithreading, those benchmark microkernels. Multi-threading is supported at the operator level, e.g. here https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L1585.

You can try using this benchmark: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc

then you will need to change https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc#L60, the nullptr on that line is the threadpool, create a proper threadpool (something like this https://github.com/google/XNNPACK/blob/master/bench/end2end.cc#L30, look at pthreadpool API to learn more).

Alternatively, you can run end2end_bench: https://github.com/google/XNNPACK/blob/master/BUILD.bazel#L7015 which has multithreaded benchmarks.

Darshvino commented 1 year ago

Hi @ngzhian,

Thank you again for your detailed reply. I will follow the links you shared.

But actually, I want to do the multithreaded benchmarks for micro-kernels. Is there any way that I can apply the thread pool for the micro-kernel and see the timings? I just need to benchmark the micro-kernel, not the full model, actually. Is there any way that you can suggest running the microkernel benchmarks with multiple threads?

Thanks

ngzhian commented 1 year ago

I want to do the multithreaded benchmarks for micro-kernels.

There is no straightforward way to do it. You will have to replicate the logic in https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L1585 to do it.

XNNPack is designed in a layered manner. Microkernels take a simple interface (https://github.com/google/XNNPACK/blob/master/src/xnnpack/microfnptr.h#L129) which does not have any idea about threads. It just computesMxN result. The threading is applied at a higher level, the operator level.

https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm.cc is a benchmark for microkernels, hence it has no idea about multithreading.

If you want to compare just microkernels, there is no reason to do multi-threading. Run the microkernels that you are comparing with the same parameters by picking 1 particular convolution in the network you're benchmarking (MR,NR,KR etc). Then use the qu8-gemm-e2e benchmarks to see how the performance is reflected at the multithreaded level.

You mentioned in https://github.com/google/XNNPACK/issues/4266#issuecomment-1423628640 that you want to benchmark an entire model, qu8-gemm-e2e is the way to do so.

Darshvino commented 1 year ago

Hi @ngzhian,

Thank you again for the detailed reply.

Actually, I have tweaked the uint8 kernel to implement the custom GEMM and I am doing the performance benchmark at both the kernel level(one layer) and model level. So, I was a bit curious to see the numbers for the custom kernel(one layer) with a single thread and 4 threads on rpi4 board.

I think I will follow the method you suggested to add the thread pool here: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc#L60

But what I was thinking is: to just put one layer in the model config and run the inference(./qu8-gemm-e2e-bench) with the thread pool in order to get the numbers for one layer with multi-threads right?

Also, can we run the benchmark at the operator level which uses the custom kernel to use for the computation of that operator?

Thanks

ngzhian commented 1 year ago

But what I was thinking is: to just put one layer in the model config and run the inference(./qu8-gemm-e2e-bench) with the thread pool in order to get the numbers for one layer with multi-threads right?

that works yes

Also, can we run the benchmark at the operator level which uses the custom kernel to use for the computation of that operator?

qu8-gemm-e2e-bench is the operator level benchmark, if you look at the model definition (https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/models/qu8-mobilenet-v1.cc) it is all creating operator, set up operator.

Darshvino commented 1 year ago

Hi @ngzhian,

I was trying to trace the execution of the operator in XNNPACK. And found that at the end it was hitting this call: https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/src/operator-run.c#L1518

I had a small doubt on how the micro-kernel(for ex: any qu8gemm kernel) is getting linked with the above pthreadpool call?

I also found the reference in the pthreadpool repo while tracing the execution: https://github.com/Maratyszcza/pthreadpool/blob/43edadc654d6283b4b6e45ba09a853181ae8e850/include/pthreadpool.h#L300

Thanks.

ngzhian commented 1 year ago

https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L1582-L1618 sets a function pointer for pthreadpool to call

Darshvino commented 1 year ago

Thanks a lot @ngzhian!.

Actually, I have defined a custom kernel with the name: xnn_qu8_gemm_minmax_rndnu_ukernel_4x16__neon_mlal_lane_1, I have added the declaration in gemm.h and microfnptr.h, and while I am running the micro-kernel benchmarks it runs perfectly fine and I get the timing numbers, but while running the operator benchmarks(qu8-gemm-e2e-bench) by adding the kernel name in the qu8-gemm-e2e.cc, it is not executing, I mean it is getting stuck here: https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/src/operator-run.c#L1518

Not sure where exactly there is an issue. It would be really great if you can assist in resolving the issue Zhi.

Thanks

ngzhian commented 1 year ago

What do you mean "getting stuck"? any errors?

Darshvino commented 1 year ago

There are no errors. It just hangs. Something like in the below picture: while executing qu8_gemm_4x16__neon_mlal_lane_1

Seems like the kernel is not properly set for executing the operator? But it is executing perfectly fine while running the micro-kernel benchmarks as I mentioned previously.

ngzhian commented 1 year ago

maybe drop into a debugger or add printfs and see what's going on? seems like it is an infinite loop. Likely there is an error in the implementation when the operator calls it (e.g. not correctly decrementing coutners.)

Darshvino commented 1 year ago

Ohh Okok, but actually the kernel is not exactly getting executed, I mean the trace is not inside the kernel Zhi, but as I said previously, I am able to run the microkernel benchmarks correctly and the results are also correct, but not sure why there is an issue while running an operator, is there any link to init.c file?

ngzhian commented 1 year ago

actually the kernel is not exactly getting executed, I mean the trace is not inside the kernel Zhi

If the kernel is not getting executed, then probably you're not specifying it in the benchmarks correctly.

It doesn't need to be in init.c You might need to run https://github.com/google/XNNPACK/blob/master/tools/update-microkernels.py to make the microkernels get picked up, and maybe https://github.com/google/XNNPACK/blob/master/scripts/generate-amalgamation.sh

If you don't get a compile error is probably fine.

init.c file doesn't need to be changed

Darshvino commented 1 year ago

Hi @ngzhian,

Thank you!

I also strongly believe that the kernel is not getting picked up correctly.

I have an old commit of the repo(I think 4 months old), whether the update-microkernels.py can be used for the older commits as well?

Also, If the kernels are getting picked up properly for micro-kernel benchmarks, how it can be the case it is not getting picked up for operator benchmarks?

Thanks

ngzhian commented 1 year ago

Not sure, if update-microkernels.py is there, then you can try it. otherwise sync your repo.

Also, If the kernels are getting picked up properly for micro-kernel benchmarks, how it can be the case it is not getting picked up for operator benchmarks?

I'm not sure, could be a bug. You can trace an existing microkernel in the operator benchmarks, see where the microkernel file is listed, and make sure your new microkernel file is also added to the right places (e.g. CMakeLists.txt)

Darshvino commented 1 year ago

Hi @ngzhian,

I checked about the mentioning of kernel names in the different files and the kernel file in different places(eg. CMakeLists.txt), and it seems like everything looks good. But seems like there is an issue here: https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/bench/qu8-gemm-e2e.cc#L50:~:text=xnn_params.qu8.gemm.,%5D%20%3D%20xnn_init_hmp_igemm_ukernel(xnn_igemm_ukernel_fn(igemm1))%3B??

because I am using xnn_qu8_gemm_minmax_ukernel_function1 type for gemm and xnn_qu8_gemm_minmax_ukernel_function type for gemm1, do we have to use the same type for gemm and gemm1?

Not exactly sure whats happening here: https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/bench/qu8-gemm-e2e.cc#L50:~:text=xnn_params.qu8.gemm.,%5D%20%3D%20xnn_init_hmp_igemm_ukernel(xnn_igemm_ukernel_fn(igemm1))%3B, seems like the kernel assignment for an operator is happening there?

ngzhian commented 1 year ago

gemm1 and gemm have the same signature

Yup, that overwrites whatever is microkernel is set in init.c with what you specified in the benchmarks.

Operators by default get the microkernels from xnn_params, which is set here https://github.com/google/XNNPACK/blob/4f2dc6081d2fdda6b37fb96ccbeb6dde25ec6538/src/init.c#L526

Overwriting xnn_params allow you to benchmark a specific microkernel.

Darshvino commented 1 year ago

Thank you! @ngzhian, that is the possible reason why there is a bug. In my case, the signature is not the same for gemm and gemm1.

For the correction: I will keep the same signature for both gemm and gemm1 and also change correspondingly in init.c, seems like this would reolve the issue. Do you agree with this?

Thanks a lot Zhi!

But, is there any specific reason, why we need both gemm and gemm1 instead of just gemm?

ngzhian commented 1 year ago

gemm1 is specialized for mr = 1, in many cases you can have a more optimized microkernel. You don't have to implement it, you can pass your gemm as gemm1 (all gemms handle mr smaller than their mr tile.)

Darshvino commented 1 year ago

Thank you, Zhi!

I did not completely get it Zhi, it would be great if you can explain a bit more clearly. My doubt was: If I implement my custom kernel with 4x16 tile, then I am not sure what is the use of 1x16 gemm.

Also, is there any possibility that I just run xnn_qu8_gemm_minmax_rndnu_ukernel_4x16__neon_mlal_lane instead of all 4 here: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc#:~:text=xnn_qu8_gemm_minmax_rndnu_ukernel_4x16__neon_mlal_lane%2C,xnn_qu8_igemm_minmax_rndnu_ukernel_1x16__neon_mlal_lane%2C like we have here: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm.cc#:~:text=xnn_qu8_gemm_minmax_rndnu_ukernel_4x16__neon_mlal_lane%2C

ngzhian commented 1 year ago

All gemm microkernels can handle "up to" the specified mr. E.g. xnn_qu8_gemm_minmax_rndnu_ukernel_4x16__neon_mlal_lane can handle mr == 1, 2, 3 or 4, though it is optimized for mr == 4.

So, you can change these 2 lines: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc#L518-L519 to xnn_qu8_gemm_minmax_rndnu_ukernel_4x16neon_mlal_lane and xnn_qu8_igemm_minmax_rndnu_ukernel_4x16neon_mlal_lane (use the mr 4 microkernel as gemm1 and igemm1) and it should work.

I.e. you don't need to implement another microkernel to pass as gemm1, you can reuse your 4x16 kernel just for benchmarking purposes.

Darshvino commented 1 year ago

Hi @ngzhian,

Thanks a lot for all your responses!

I was able to solve and run the custom kernel via an operator layer. Currently, I am working on applying multi-threading to an operator. I am referring to your reply: https://github.com/google/XNNPACK/issues/4266#issuecomment-1433585981

Thanks a lot again Zhi!

Darshvino commented 1 year ago

Hi @ngzhian,

I tried to create a thread pool before this line: https://github.com/google/XNNPACK/blob/master/bench/qu8-gemm-e2e.cc#L60, something like this:

 const size_t num_threads = 4; 
 std::unique_ptr<pthreadpool, decltype(&pthreadpool_destroy)> threadpool(
 pthreadpool_create(num_threads), pthreadpool_destroy);

 auto execution_plan = model_factory(threadpool.get());

But it does not seem to be working Zhi, means there is no difference in the timing while passing nullptr and passing the threadpool.

Darshvino commented 1 year ago

Ohh I fixed it Zhi!

Just to pass the threadpool here: https://github.com/google/XNNPACK/blob/407bef64031c4a5977f67564d74f8a75d211f4b4/bench/qu8-gemm-e2e.cc#L68

It is working now!

Darshvino commented 1 year ago

Hi @ngzhian,

Hope you are doing well.

I had two doubts about the execution of the model:

1.) What is exactly happening in create_convolution2d_nhwc() and setup_convolution2d_nhwc() in the script convolution-nhwc.c? I think those were the main functions of the operator and microkernel interface, just wanted to know more about it.

2.) Where is the micro kernel selection is happening during the execution of an operator or a model? Like xnn_qu8_gemm_minmax_rndnu_ukernel_4x8aarch32_neon_mlal_lane_prfm_cortex_a53 or xnn_qu8_gemm_minmax_rndnu_ukernel_4x8aarch32_neon_mlal_lane_cortex_a7 etc.?

It would be great if you can assist in understanding the above 2 doubts.

Thanks

ngzhian commented 1 year ago

1) you can read the source to find out https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nhwc.c#L419 create does things like packing of weights, set does parallelization

2) we set the microkernels in https://github.com/google/XNNPACK/blob/master/src/init.c

Darshvino commented 1 year ago

Thanks a lot, @ngzhian.

I was able to create Doxygen call graphs for XNNPACK and trying to understand the execution calls.

Darshvino commented 1 year ago

Hi @ngzhian,

I was planning to build XNNPACK with any frontends(Pytorch/ONNXRT, etc) on X86 in order to run the end2end model using the XNNPACK X86 igemm/gemm kernels. I am facing errors while building Tflite with XNNPACK on ARM(rpi4), so want to build on an X86 device for the end2end results.

1.) What way would you suggest building XNNPACK with any frontends on X86? I mean which frontend is easy to build with? 2.) For ARM, can I try to build TFLite+XNNPACK directly or through cross-compilation? Can you please share any good references for building TFLITE with XNNPACK on raspberry pi4?

Thanks!

ngzhian commented 1 year ago

you don't need any frontends to run end2end_bench.

TFLite is the most supported.

See https://github.com/google/XNNPACK/blob/master/scripts/build-android-arm64.sh for how we crosscompile XNNPack, I imagine TFLite will have something similar, otherwise it is a good question to ask TFLite.

Darshvino commented 1 year ago

We actually need a frontends to run a model right? like for ONNX, pytorch models, etc on X86?

ngzhian commented 1 year ago

What do you mean run a model? If you have a TFLite model file, then yes you need TFLite. If you want to run an existing model like MobileNetV2, we already have it in end2end.h https://github.com/google/XNNPACK/blob/master/bench/end2end.h

We converted the model into direct calls to XNNPack API, so we don't need a model file at all. You need to build end2end_bench, and run it.

Darshvino commented 1 year ago

Yeah yeah exactly, I mean to run other models than we have in TFLITE(I think we have only for mobilnetv1,v2,v3) but I want to test for other models like resnet, and other models, so I think defining like this is cumbersome right: https://github.com/google/XNNPACK/blob/master/models/fp32-mobilenet-v1.cc.

So I wanted to go for TFLITE with XNNPACK and build both on raspberry pi. So, while building I was facing issues wrt version compatibility and it was getting stuck sometimes due to less resources for the build on rpi4. So I planned to build XNNPACK with pytorch or ONNXRRT in order to use the XNNPACK kernels while running the Pytorch or ONNX models.

Thanks.

Darshvino commented 1 year ago

Hi @ngzhian,

How you would suggest running the end2end models? Either with Tflite with XNNPACK or ONNXRT with XNNPACK or Pytorch with XNNPACK?

Thanks

ngzhian commented 1 year ago

TFLite with XNNPack

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/xnnpack/README.md

or https://blog.tensorflow.org/2020/07/accelerating-tensorflow-lite-xnnpack-integration.html search for "Try out XNNPACK with your TensorFlow Lite model"

Darshvino commented 1 year ago

Thanks a lot @ngzhian,

Would you suggest building on Raspberrypi or cross compile?

ngzhian commented 1 year ago

Whichever is easier for you to set up, I guess. There's probably no difference in the output. I usually cross compile, because we have scripts to help do that (see scripts/).

Darshvino commented 1 year ago

Ohh okay got it.

I was reading this blog: and it mentions that XNNPACK is already available as part of TFLITE binaries: The XNNPACK backend is already included in pre-built TensorFlow Lite 2.3 binaries

"The XNNPACK backend is already included in pre-built TensorFlow Lite 2.3 binaries"

Is it easier to use XNNPACK with this?

ngzhian commented 1 year ago

That's only useful if you don't need to make modifications to XNNPack, but you are making modifications, so probably that doesn't work. Unless you build TFLite from source (including XNNPack sources).

google / XNNPACK

End to end model speedup results #4266