stefanbesler / twingrind

Call-graph profiling for TwinCAT 3
GNU General Public License v3.0
24 stars 4 forks source link

Understanding my task bottleneck #12

Closed captain-yoshi closed 1 year ago

captain-yoshi commented 1 year ago

I have a task which is set to a 1kHz loop as configured below:

lol

This is the Task online graph which shows the CPU time (don't know if it's the mean?) spent on the task. Notice that there are a lot of Exceeded counter, e.g. > 1 milliseconds.

TwinGrind-Slowest-ok

This graph represents the last execution time from Tc2_Utilities.PROFILER. It seems to corroborate the Exceeded counter: certain spikes goes above the 1kHz.

TwinGrind-Slowest

This graph was generated using the slowest Capture Mode in continuous mode, which gives 69.65% for my task. That represents about 696 microseconds. This would confirm that the code is well under 1 millisecond.

Untitled

So for my question: Is my assumption correct ? Does that mean that code is less then 1 kHz but there is something else that can't execute it properly under 1kHz ? This could be the OS, how the task is configured (priorities), I/O's, etc... Note that this was generated on a non-Beckhoff PC with Windows 10 (i3-2100 @ 3.1 GHz).

stefanbesler commented 1 year ago

Yes, I think your assumption is correct. And on Hardware that is not optimized for real time this behaviour is pretty normal. Did you run the profiler in continuous mode for a bit? Maybe you where just lucky and didn’t get a spike?

Maybe try to disable hyperthreading and do the other settings in Windows and BIOS that Beckhoff recommends. There are also some undocumented settings that you will only get from Beckhoff support, i.e. there is at least 1 undocumented registry variable, which helps quite a bit (I don’t remember the name, something with RT).

I usually aim for a baseline of < 30%. Especially if other somewhere like image processing is running on the same computer spikes aren’t be expected and just from experience I know that below 30% you are on the save side.

Your call graph looks pretty reasonable, and I don’t really see an issue from there. ChainJointToJacobianSolver seems a bit slow and maybe you can optimize the calculations? What I found is that some mathematical functions like EXPT are very slow and should be avoided

Are there calls to additional libraries for which you could enable profiling to see more? Maybe you can try to use more cycles for the calculations?

captain-yoshi commented 1 year ago

Thank you for your detailed answer! Coming from a c++ world, so I'm still learning TwinCAT/Beckhoff ecosystem.

Did you run the profiler in continuous mode for a bit? Maybe you where just lucky and didn’t get a spike?

Yup. I made sure to see at least 15 high spikes while in continuous mode.

ChainJointToJacobianSolver seems a bit slow and maybe you can optimize the calculations?

There are a lot of matrix calculations done by a third library that could be optimized, e.g, a lot of matrices that are passed by copy instead of reference and some calculations that could be even faster.

Are there calls to additional libraries for which you could enable profiling to see more?

I will enable profiling for the matrix libraries: TcMatrix and TcGeometry.


I will try to disable hyperthreading + other recommandations first and try to optimize the matrix calculations next.

seehma commented 1 year ago

You can switch I/O Idle Task to 1ms and above (2ms), this is a supporting tasks for Twincat and it usually does not need to be that fast. Maybe it would be a good idea, if you are on hyperthreading, to isolate two cores and separate the plc task on one isolated cores from the other two on the second isolated core.

However, in your listing I/O Idle Task is very fast, maybe some of the exceed counts are coming from this one?

If you do not use multithreading, choose your base time higher -> if 1kHz is the cycletime of your choice, then try to set base time to 1ms. Every base-time-tick, the scheduler starts running and evaluates if something else needs time, this is not needed here, because there is no other task so we do not want/need to evaluate.

One more thing i saw in your pictures above, you have 2048kb of stack, do you need that much stack? allocating variables on stack is very slow, maybe here are some points to find performance-pacmans? Try to use VAR_INST in methods whereever possible to be as local as needed with your variables and avoid stack-allocation on method-execution.

captain-yoshi commented 1 year ago

I get much more stable behavior when disabling hyper-threading and using an isolated core. No more Exceeded counter for the last 15 minutes.

TaskOnline-Graph

And the LastExecTime from TwinCAT profiler's object:

LastExecTime-Graph

The task seems to have reduced in computation also... I would have thought the task to use 70% instead of 54%, but the overhead might be the I/O or something else.

Grind-Task1-Frame8

captain-yoshi commented 1 year ago

@seehma

However, in your listing I/O Idle Task is very fast, maybe some of the exceed counts are coming from this one?

Did not check my I/O Idle Task, but for now it does nothing. Reading 1.2 usecs on the Online graph for this part.

if 1kHz is the cycletime of your choice, then try to set base time to 1ms

Interesting ! I will give it a try as I do not use multithreading.

One more thing i saw in your pictures above, you have 2048kb of stack, do you need that much stack?

I don't think I need that much. I try to pass things by reference or use member variables (declared in the FB) for member functions. Did not know about VAR_INST. There are definitely methods where I should use this.

stefanbesler commented 1 year ago

I don't use TcMatrix or TcGeometry, but doing a quick check on TcMatrix I saw that it uses VAR_IN_OUT and VAR_IN_OUT CONSTANT, which is effectively the same as using VAR_INPUT with references, so it should be fast and doesn't create copies. They are open source so maybe try to add twingrind profiling to them, maybe you see something suspicous?

The codesys compiler is not optimizing during compiling, not even simple things like loop-unrolling. That's why for matrix operations I prefer to unroll the loop myself (e.g. when I implement such method I don't use for loops, but write everything out, I looks bad, but it actually helps) If you have for loops with dynamic ranges, e.g. where the bounds call a method or need evaluation ( FOR: i:= function1() TO function2() or FOR i:=0 TO count-1) is super slow. In C++ the compiler usually figures out if the bounds are actually static so you don't have to think about this stuff, in structured text you have to be a bit careful, some tips:

Of course I don't know what your code does, but using 60% with "only" calculating some kinematics of a 6 axis robot seems a lot.

captain-yoshi commented 1 year ago

I had the impression that there were optimizations in TwinCAT... This might be only for c++. So in structured text there is no optimizations, I see.

Yeah the TcMatrix uses references but TcTransform (which I wrongfully named TcGeometry) has a lot of sections like this which passes a 3x3 and a 3x1 object by copie if I'm not mistaken :

image

Thank you for these tips this is super useful !

Yeah 36% (based on the latest graph) is a lot for kinematics of a 6DOF. In c++, with some optimization, it is super fast ! I based my code on a c++ library that does this kinda of stuff.

I will profile the 2 matrix libraries and see where I should prioritized my optimizations.

stefanbesler commented 1 year ago

I see, and yes this is not good...

With TwinCAT C++ you get all the optimizations, I don't think that Beckhoff is turning these off, should be everything there that you would get from any other program you implement with the vsct.

The reason i ported ruckig to twincat (struckig) was actually to see how the codesys compiler is holding up to C++. It is actual not so bad as I orginally anticipated, if I recall correctly the twincat structure text implementation is about 5 times slower. I simply use pass-by-reference everywhere and try not to use the stack where it is not needed and ruckig itself is written in a way to avoid branches

captain-yoshi commented 1 year ago

I have enabled twingrind on the TcMatrix and TcTransform libraries. But when I try to reconstructs a callgrind file using the process command it fails because of an max recursion depth error.

A part for not being able to visualize the data, I assume that this is bad in codesys and could potentially slow down the task a lot. I'm pretty sure this has to do with one of the libraries. Is there a way to change the max depth recursion ?

$ twingrind process -m hashmap
INFO:root:Connecting :851
DEBUG:root:Trying to connect to profiler at Profiler
DEBUG:root:Trying to connect to profiler at Profiler
INFO:root:Fetching callstacks from PLC with
    entrypoint = Profiler
    max_stacksize = 320000
    max_frames = 10
    max_cycletime (s) = 0.001
    tasks = 1
INFO:root:Fetched Callstack 0 (Task 1) with calls 22179 to ./callstack_frame_0_task_1
DEBUG:root:Reconfiguring profiler to initial setup
INFO:root:Reconstructing callstack ./callstack_frame_0_task_1
DEBUG:root:Callstack size=44358
INFO:root:Extracted 22179 calls
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Scripts\twingrind.exe\__main__.py", line 7, in <module>
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\__main__.py", line 80, in main
    pytwingrind.reconstruct.run(args["hashmap"], callstack, args["directory"], "")
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 133, in run
    build_graph(n, hm, ['root'], data)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  [Previous line repeated 2 more times]
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  [Previous line repeated 24 more times]
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  [Previous line repeated 833 more times]
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 63, in build_graph
    build_graph(network, hashmap, roots, data, endid+1, eid)
  [Previous line repeated 112 more times]
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 60, in build_graph
    build_graph(network, hashmap, roots + [sid], data, sid+1, endid)
  File "C:\Users\lilou\AppData\Local\Programs\Python\Python312\Lib\site-packages\pytwingrind\reconstruct.py", line 44, in build_graph
    endid = np.where(np.logical_and(data[sid+1:eid, StackRow.HASH] == data[sid, StackRow.HASH],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

callstack_frame_0_task_1.zip

seehma commented 1 year ago

@seehma

However, in your listing I/O Idle Task is very fast, maybe some of the exceed counts are coming from this one?

Did not check my I/O Idle Task, but for now it does nothing. Reading 1.2 usecs on the Online graph for this part.

if 1kHz is the cycletime of your choice, then try to set base time to 1ms

Interesting ! I will give it a try as I do not use multithreading.

One more thing i saw in your pictures above, you have 2048kb of stack, do you need that much stack?

I don't think I need that much. I try to pass things by reference or use member variables (declared in the FB) for member functions. Did not know about VAR_INST. There are definitely methods where I should use this.

I/O Idle Task starts needing more of your computation-power when you add Ethercat-Devices and when you do some parametrization, State-machine-changes of the slaves or asynchronous access to parameters. The more devices, the more computation power is needed by this task. However, most of the things done there is asynchronous stuff, and because of that you can move this task and let it run slower.

stefanbesler commented 1 year ago

You can try to edit reconstruct.py and increase the recursion limit with sys.setrecursionlimit(n)

Maybe try to find the issue in TcTransform and get rid of the issue itself, maybe it is also a bug in Twingrind, I’ll look into it

additionally to your callstack, can you please also upload the hash map file?

stefanbesler commented 1 year ago

Please try with this build of twingrind (it includes the latest commit https://github.com/stefanbesler/twingrind/commit/93b27f6b9130953f7907b86912b16a01717be2f7

https://github.com/stefanbesler/twingrind/suites/17570325444/artifacts/1003853059

I added a new parameter ("--recursion-limit 2000") for reconstruct and process and made the default recursionlimit = 2000, which is sufficient for your callstack. Please update me if it worked and I am curious to see the visualization, I only have it with anonouymous method names, because I don't have the hashmap

image

captain-yoshi commented 1 year ago

Here is the hashmap file. I will try with the new build later today and report back. hashmap.zip

stefanbesler commented 1 year ago

I think we found your culprit

image

You have a total of 22 Matrix Multiplications and they use 30% of the task

TcMatrix::GetRC is called a lot and doing loads of checks in for loops that are not desirable for you. I like the code of TcMatrix, but since the codesys compiler is not optimizing, it is pretty slow doing it like that. A C++ compiler would inline all these method and turn to code into a direct access of a memory location and it you properly be ok to do it like that. However, in maths function I think it is better to avoid unnecessary calls and do everything as direct as possible even if it means some code duplication. (just my personal opinion)

If you want this faster I think you need to use or implement a matrix library that is more bare bone and directly operates on memory. I know that Zeugwerk Framework has some algebra implemented and that it directly operates on memory. You'll need a license I think, but there is a trial available via Twinpack, if you are interested in this you might want to contact @seehma Maybe you can also multiply the matrices in a program like Mathematica, let it simplify for you and use the formulas provided by it? I guess there are a lot of components 0 and cancel out in the final product?

captain-yoshi commented 1 year ago

Super! The latest build works.

I started adding code duplication in some places, like this to have more speed. I will try to add multiple matrix products of pre-defined sizes and try to access the array directly, avoiding the GetRC and GetI and multiple checks.

I will also check the Zeugwerk Framework.

BTW Thanks for the amazing library and your time! It really is great to solve these kind of things.

captain-yoshi commented 1 year ago

Just to conclude on my optimizations. So without too much efforts, I was able to reduce my task execution time:

image

stefanbesler commented 1 year ago

Awesome!! Thanks for the feedback