machinekit / machinekit-hal

Universal framework for machine control based on Hardware Abstraction Layer principle
https://www.machinekit.io
Other
109 stars 62 forks source link

hal_lib does not free or reuse memory from unloaded components until it quits. #142

Open ArcEye opened 6 years ago

ArcEye commented 6 years ago

This issue was discovered in https://github.com/machinekit/machinekit/issues/1123#issuecomment-282989556.

Never actually fixed as it seemed to rely upon an extreme use of halcmd to encounter it.

Since found through @l29ah that there is a M109 script which actually promotes the usage which gives rise to the error.

ArcEye commented 6 years ago

From @l29ah

I regularily hit the halcmd show pin foo hanging problem when i poll temperature on my 3d printer on am3358; 3f1e265 here.

ArcEye commented 6 years ago

Can you move this to the relevant Issue tracker as per email on the list https://groups.google.com/forum/#!topic/machinekit/I70IfT-wan0

Issue tracker will be https://github.com/machinekit/machinekit-hal/issues

Will also need to explain what exactly you are doing and why you think that particular commit causes it.

There is a known problem with repeatedly polling halcmd pin to get an output instead of doing it in a programmatic way.

1123 (comment)

I think it is may be due to the way memory is ordered on boundaries to enable atomic operations. This would result in orphaned memory and if an operation is repeated a huge number of times, you run out of hal memory.

ArcEye commented 6 years ago

From @l29ah I am at the issue tracker :) I'm using the included nc_files/M109 to wait for the temperature to settle. As far i understand, this IS a programmatic way. Sometimes it will never return, and poking halcmd suggesting the problem outlined in #1123 (comment) As far as i understood, this issue is considered fixed, so i'm writing to say it is not. The commit hash is the history point where i observe the behaviour; i don't tell that the commit is the case.

ArcEye commented 6 years ago

You are at CLOSED general Issue tracker that mentions the problem amoungst many others.

The problem is it will remain closed, so to air this issue you need to open a new one.

I'm using the included nc_files/M109 to wait for the temperature to settle. As far i understand, this IS a programmatic way.

There is nothing programmatic about using a bash script to repeatedly call halcmd and then try to parse the output.

I was referring to finding the hal_pin_t struct for the pin name in question and reading its _data_ptr_addr for the value. This doesn't have the side effects of repeatedly loading the whole halcmd component

A comment that github decided to hide for some reason, showed that using halpr_find_pin_by_name() , which does what I described above, ran a huge number of times without issue (10,000 times or equivalent to running M109 for 2.777 hours continuously)

Just run a program testmem via halcmd loadusr -W testmem 10000 that does 10,000 iterations of halpr_find_pin_by_name(), gets the value and prints the result. Ran to the end without any issues.

The issue was never 'solved', it just appeared an extreme use of halcmd which appeared unlikely to be encountered. I was not aware there was a M code script which did exactly the same thing, but then I am not into plastic squirting :stuck_out_tongue_winking_eye:

I will move this into a separate issue, in the new repo and see if I can find time write a user component which takes a pin name and value and outputs TRUE when the value is met or exceeded, or similar.

Will have to look at how this is used though, I imagine the call to M109 is blocking and only returns when the bed is up to temp, thus pausing the GCode.

ArcEye commented 6 years ago

The larger issue is to look hard at hal memory allocation, to see if there is a way of tracking the unused memory before the boundary used to start allocation and free that too when the component instance is freed ( or suchlike)

l29ah commented 6 years ago

Yes it's blocking. What would be a better way to implement simple custom g-code commands addressing the hal parameters? Seems like the BBB is having a hard time calling the scripts as even M104 causes it to pause for a second or two.

ArcEye commented 6 years ago

You can try accessing the pin from gcode. That should access a read only copy directly rather than launching halcmd.

I think you need FEATURES=12 in your ini file and then something like this to call

;; need to set e0.temp.set to same as param #1 first

o<m109> sub
    #<_tempReq> = #1
    #<_startTemp> = #<_hal[Therm.temp0]>
    o101 while [#<_hal[Therm.temp0]> LT #<_tempReq>]
        G4 P1
    o101 endwhile
    (DEBUG, "Starting temp was #<_startTemp>")
    (DEBUG, "Temp now is #<_hal[Therm.temp0]>")
o<m109> endsub
m2

This is off the top of my head and untested as I don't have anything that uses those pins, but something similar should work OK.

A starting point would be to issue a setp e0.temp.set <temp> in the HAL Configuration window of Axis and then run the MDI panel and repeatedly enter (DEBUG, #<_hal[Therm.temp0]>) to get a read out of increasing temp.

Once satisfied it works you could implement a sub similar to above and have a user M code which simply sets the required temp and call that first.

ArcEye commented 6 years ago

Just tested accessing hal pins still works and it does.

Put into the [RS274NGC] section of the ini file FEATURES = 12

Then issue a MDI command to test, example below checks if axisui.run-disable is set

Output is to bottom right of axis. (output is always fp even if pin is a bit)

selection_064

ArcEye commented 6 years ago

I have now written a C component to replace M109 (albeit presently reading the value of halui.flood.is-on, because I don't have the required pins to test on)

Test file is a short gcode file which moves to X20 Y20, then call M109 P1 and waits for halui.flood.is-on value to change to 1

G40 G90 G21 
g0 X20 Y20 
M109 P1
g0 x0 y0
M2

The print below shows it went through over half a million iterations (final figure when I got bored and stopped it was 628494), reading the pin and outputting its value.

selection_068

So I think we can conclusively say that will solve the problem experienced with repeatedly polling halcmd to do the same.

I will now try to add one that targets the Therm.temp0 pin to the build.

Are you able to test with a source build on your system, or do you just use packages?

l29ah commented 6 years ago

Yes, i can.

ArcEye commented 6 years ago

Excellent, I have the test component building in the tree now, just need to rewrite to target Therm.temp0.

I will push to my repo and come back to you when done. You will need to clone my repo, checkout the new-M109 branch and build.

I will add a new M code M119 to just set the temperature. Then you can use M119 P<temp> to set the temperature and then M109 P<temp> to wait for it

Probably won't get time to finish until tomorrow, will notify when ready

l29ah commented 6 years ago

On Sat, Aug 11, 2018 at 09:12:35AM -0700, ArcEye wrote:

I will add a new M code M119 to just set the temperature.

You should use M104 (or M140, for heated bed) for non-blocking setting of the target temperature, since it's used by everyone else in 3D printing, i think.

-- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ - against proprietary attachments

ArcEye commented 6 years ago

OK, I'll be guided by you, didn't realise they were already there, makes things easier.

What I don't understand is why all the instructions talk about using M1xx Snn

It should be M1xx Pnn http://www.machinekit.io/docs/gcode/m-code/#sec:M100-to-M199

You can in some circs use a third arg called S, but there can be problems with it getting mixed up with a speed setting AFAIR.

l29ah commented 6 years ago

You're right, it is Pnn for setting the target temperature, and it works this way for me. Slic3r has a machinekit g-code output format that generates Pnn codes, but apparently RepRap-style gcode interpreters use Snn, and S0 is reserved for switching the heater off completely regardless the temperature :/

ArcEye commented 6 years ago

OK, glad I wasn't going mad, I'll come back to you tomorrow

ArcEye commented 6 years ago

The component and Submakefile changes all build OK so looks ready to test.

The changes are at https://github.com/ArcEye/machinekit/commit/132643b2acc70eb961a9d9a7c3140bbcd9c56dd7

Clone my machinekit repo, checkout branch new-M109 and then build a RIP.

Be sure to run in a terminal, I have left in the debug prints to stderr, which will show the pin value read and the value you told it to wait for. Hopefully this will give a safety net so you can be sure it has the right values, the temperature is rising and it returns at the right point.

I await your results.

l29ah commented 6 years ago

It doesn't set the target temperature like original M109 did and expected to behave, so everyone'll have to pre-process g-code before feeding it to machinekit if M109 is not prepended by M104. OTOH, slic3r does call M104 before M109, at least in simple cases.

ArcEye commented 6 years ago

Can you test it works, call M104 first for now. I can add something later to set the temp as well.

ArcEye commented 6 years ago

There is an issue with parameters to M codes that I need to check on. Machinekit passes a float value to the component, even if the actual value in gcode is essentially bool eg. 1 becomes 1.000000

Need to check the value received and the one read from the pin are converted properly to integers, but I have run out of time for now, return to it tomorrow sometime hopefully

l29ah commented 6 years ago
can't open none
can't open none
Counter: 1 Last value: 0 Target = 110
Counter: 2 Last value: 0 Target = 110
Counter: 3 Last value: 0 Target = 110
Counter: 4 Last value: 0 Target = 110
Counter: 5 Last value: 0 Target = 110

I feel like i've misconfigured something, even though i've corrected the source to point to a proper signal name. Too bad i can't run this thing under gdb.

ArcEye commented 6 years ago

can't open none can't open none

That output did not come from the component. Looks like the output from a python script that has been passed an empty string.

i've corrected the source to point to a proper signal name

What is that name? I just copied the old script. Also I thought it was a pin not a signal?
Only pins normally have <component-name>.<pin-name> designations.

I need to find some more time to experiment with this, using pins that I do have and choosing one that is going to increase in value despite the gcode progress being blocked by the M109 call.

Probably easiest to write another component that gets switched on by a fake M104 call then increments a counter which can be read by M109.

ArcEye commented 6 years ago

I have updated the M109 component with a generic writing function which can be tidied later. It now writes whatever value is passed at M109 P<value> to e0.temp.set and reads Therm.temp0 until it matches the value.

I tested by running this component

component counter;

pin in u32 target = 0;
pin out u32 count = 0;
option singleton yes;               
option userspace yes;

author "ArcEye <arceyeATmgwareDOTcoDOTuk>";
license "GPL";
;;

#include <stdio.h>    /* Standard input/output definitions */
#include <stdlib.h> 
#include <stdint.h>   /* Standard types */
#include <unistd.h>   /* UNIX standard function definitions */

void user_mainloop(void)
{
int inc = 0;

    while(1)
    {
    usleep(500000);
        if(target > 0)
        {
        if(count >= target)
        exit(0);
        else
        count = ++inc;
        }
    }
    exit(0);
}

and changing the pin names in M109 to counter.target and counter.count respectively.

The result was M109 does exactly as expected, in the screenshot below, receives a value of 50, sets counter.target to 50 and then waits until counter.count is 50 and exits.

selection_069

The new M109 is in my repo.

l29ah commented 6 years ago

Also I thought it was a pin not a signal?

Yes, it's a signal:

newsig e0.temp.meas  float

I've used configs/ARM/BeagleBone/ as a reference for my hal configuration.

ArcEye commented 6 years ago

I looked at configs/ARM/BeagleBone/CRAMPS and got even more confused.

e0.temp.set appears to not have a writer so can be written to, so I have added a function to write to signal instead of pin.

Are you suggesting reading a signal to get the temperature, I think that is what e0.temp.meas carries. You could just read the pin Therm.ch-04.value which is linked to it

However, I don't know why the original M109 script has a completely different pin it reads for the temperature (Therm.temp0) and you were apparently using this script and it worked.

l29ah commented 6 years ago

On Mon, Aug 13, 2018 at 08:50:08AM -0700, ArcEye wrote:

You could just read the pin Therm.ch-04.value which is linked to it

Oh, thanks, will try it.

However, I don't know why the original M109 script has a completely different pin it reads for the temperature (Therm.temp0) and you were apparently using this script and it worked.

I adjusted it to e0.temp.meas.

-- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ - against proprietary attachments

ArcEye commented 6 years ago

Right. I will change the component to write to signal e0.temp.set to set the temperature and read from Therm.ch-04.value to get the temperature.

Let you know when it is in the repo.

l29ah commented 6 years ago

Okay, now it reads the temperature, but only once:

Counter: 1 Last value: 161 Target = 220
Counter: 2 Last value: 161 Target = 220
Counter: 3 Last value: 161 Target = 220
Counter: 4 Last value: 161 Target = 220
Counter: 5 Last value: 161 Target = 220
ArcEye commented 6 years ago

The new component is in the repo, try that.

The print shows it read the pin 5 times, its just that the value stayed the same.

l29ah commented 6 years ago
Counter: 486 Last value: 161 Target = 220

And i observe the temperature really changing by other means.

ArcEye commented 6 years ago

OK you need to identify what pin actually matches the temperature. The component is reading what the pin's value is, it just isn't changing.

l29ah commented 6 years ago

It does change.

‰ halcmd gets e0.temp.meas
29.2                                                
‰ halcmd getp Therm.ch-06.value
28.8

while

Counter: 2146 Last value: 161 Target = 220
ArcEye commented 6 years ago

Unless you have edited it, it is not reading Therm.ch-06.value and that pin does not appear in the CRAMPS.hal file.

l29ah commented 6 years ago

Sure i've edited it.

ArcEye commented 6 years ago

I am getting nowhere trying to do something for a board and system I don't have and can't test.

I am also far from sure that the pins and signals are universal, thus hard coding them in a component does not make much sense. (eg. Therm.ch-06.value does not exist in CRAMPS configs, but you are using it)

Feel free to use the component as a basis for something for your own uses, it works fine on amd64 as demonstrated, don't know why BBB is a problem.

l29ah commented 6 years ago

On Tue, Aug 14, 2018 at 03:32:20AM -0700, ArcEye wrote:

I am getting nowhere trying to do something for a board and system I don't have and can't test.

I can give you the ssh access to the thing if you feel like poking it. Send me your ssh public key.

-- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ - against proprietary attachments

ArcEye commented 6 years ago

It might come to that, but I don't have time for a couple of days - come back to you.

ArcEye commented 6 years ago

Returning to the original problem.

The clearest way to see what is happening is to start realtime and run this: watch -n 0.1 halcmd show heap

With each iteration of loading hallib and halcmd, then unloading them, the available HAL heap is shown to reduce.

user@INTEL-i7:/prog/Qt5/QtTestMem$ DEBUG=5 realtime restart
rtapi_msgd command:  /usr/src/machinekit-hal/libexec/rtapi_msgd --instance=0 --rtmsglevel=5 --usrmsglevel=5 --halsize=524288
rtapi_app command:  /usr/src/machinekit-hal/libexec/rtapi_app_posix --instance=0
user@INTEL-i7:/prog/Qt5/QtTestMem$ halcmd show heap
total_avail=261464 fragments=1 largest=261464
user@INTEL-i7:/prog/Qt5/QtTestMem$ halcmd show heap
total_avail=261256 fragments=1 largest=261256
user@INTEL-i7:/prog/Qt5/QtTestMem$ halcmd show heap
total_avail=261048 fragments=1 largest=261048
user@INTEL-i7:/prog/Qt5/QtTestMem$ halcmd show heap
total_avail=260840 fragments=1 largest=260840:

Loss per iteration 208 bytes

ArcEye commented 6 years ago

NB. This not a tutorial, I had nothing to do with the HAL memory rewrite. It is exploration notes in case required later.

The HAL memory functions have changed considerably from legacy.

Note a legacy comment

From hal.h (regards the hal_free() mechanism)

hal_malloc() allocates a block of memory from the main HAL shared memory area. It should be used by all components to allocate memory for HAL pins and parameters. It allocates 'size' bytes, and returns a pointer to the allocated space, or NULL (0) on error. The returned pointer will be properly aligned for any variable HAL supports (see HAL_TYPE below.) __The allocator is very simple, and there is no 'free'. It is assumed that a component will allocate all the memory it needs during initialization. The entire HAL shared memory area is freed when the last component calls hal_exit(). This means that if you continuously install and remove one component while other components are present, you eventually will fill up the shared memory and an install will fail. Removing all components completely clears memory and you start fresh.__

Components ( of which halcmd is one ) use a component struct

From hal_lib.c

__The alloc_xxx_struct() functions allocate a structure of the appropriate type and return a pointer to it, or 0 if they fail. They attempt to re-use freed structs first, if none are available, then they call hal_malloc() to create a new one. The free_xxx_struct() functions add the structure at 'p' to the appropriate free list, for potential re-use later.__

Which tends to suggest that the last halcmd struct memory area that was freed, should be automatically allocated to the next one.

However halpr_alloc_comp_struct() has been commented out of hal_priv.h, appears as a forward declaration on hal_comp.c, but is not called or fleshed out anywhere.

That function appears now covered by hal_object.c void* halg_create_objectf(false, sizeof(hal_comp_t), HAL_COMPONENT, 0, name);

This allocates memory according to whether byte alignment is required, using either shmalloc_desc_aligned(size, global_data->hal_descriptor_alignment); or shmalloc_desc(allocsize);

If the memory allocation succeeds but the initialisation of the object fails for some reason shmfree_desc( void *p) is used to free it

(rtapi_heap.c) This in turn calls rtapi_free(&hal_data->heap, p); which calls _rtapi_unlocked_free(h, p);

This function appears to deal with freeing aligned memory as well as unaligned and contains several distinctive info prints, none of which appear in a log from repeatedly calling halcmd as described.

This strongly suggests, the memory from the halcmd component is not being freed.

There is also no suggestion that the memory space occupied by the last iteration is reused.

ArcEye commented 6 years ago

A hal_comp_t struct is 96 bytes. Both hal_lib and halcommand are essentially components so they account for 192 of the missing 208 bytes per iteration.

In the case of this script

DEBUG=5 realtime restart
halcmd newinst debounce db pincount=4
halcmd newthread servo 1000000 fp
halcmd addf db servo
halcmd start
watch -n 0.1 halcmd show pin db.funct.time

normal sequence of each iteration of this in the log output

Feb 28 14:23:39 INTEL-i7 rtapi:0: 4:rtapi_app:14901:user pid=14901 flavor=rt-preempt gcc=4.9.2 git=unknown
Feb 28 14:23:39 INTEL-i7 rtapi:0: 4:rtapi_app:14901:user pid=14901 flavor=rt-preempt gcc=4.9.2 git=unknown
Feb 28 14:23:39 INTEL-i7 msgd:0: ulapi:16104:user _ulapi_init(): ulapi rt-preempt unknown loaded
Feb 28 14:23:39 INTEL-i7 msgd:0: ulapi:16104:user halg_xinitfv:271 HAL: singleton component 'hal_lib16104' id=1014 initialized
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user --halcmd show pin db.funct.time
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user halg_exit:293 HAL: removing component 1016 'halcmd16104'
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user ulapi_hal_lib_cleanup:235 HAL: lib_module_id=1014
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user halg_exit:293 HAL: removing component 1014 'hal_lib16104'
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user halg_exit:315 HAL: hal_errorcount()=0
Feb 28 14:23:39 INTEL-i7 msgd:0: hal_lib:16104:user halg_exit:316 HAL: _halerrno=0

According to hal_lib.c, the hal_lib component is not initialised until the first component is created. So calling halcmd newinst db debounce pincount=4 loads a hal_lib instance and then the component debounce, from which db is instantiated. The hal_lib instance is what creates the HAL heap, and it is not cleaned until that hal_lib instance exits, via hal_sweep().

Looks like as long as the hal_lib created for the original component creation exists, the original legacy comment is correct, if you continuously install and remove one component while other components are present, you eventually will fill up the shared memory and an install will fail

ArcEye commented 6 years ago

Armed with that knowledge, change the script to make the hal_lib exit after each printout

# !/bin/bash

DEBUG=5 realtime restart
halcmd show heap
halrun -U

The resulting print shows no loss of memory because the heap is freed and re-initialised between each call to halcommand

rtapi_msgd command:  /usr/src/machinekit-hal/libexec/rtapi_msgd --instance=0 --rtmsglevel=5 --usrmsglevel=5 --halsize=524288
rtapi_app command:  /usr/src/machinekit-hal/libexec/rtapi_app_posix --instance=0
total_avail=261464 fragments=1 largest=261464
<commandline>:0: Realtime threads stopped
rtapi_msgd command:  /usr/src/machinekit-hal/libexec/rtapi_msgd --instance=0 --rtmsglevel=5 --usrmsglevel=5 --halsize=524288
rtapi_app command:  /usr/src/machinekit-hal/libexec/rtapi_app_posix --instance=0
total_avail=261464 fragments=1 largest=261464
<commandline>:0: Realtime threads stopped
rtapi_msgd command:  /usr/src/machinekit-hal/libexec/rtapi_msgd --instance=0 --rtmsglevel=5 --usrmsglevel=5 --halsize=524288
rtapi_app command:  /usr/src/machinekit-hal/libexec/rtapi_app_posix --instance=0
total_avail=261464 fragments=1 largest=261464
<commandline>:0: Realtime threads stopped

Not that this is a method that can be used :frowning_face:

ArcEye commented 6 years ago

The problem it now appears is the failure to free memory used by halcmdXXX and hallibXXX instances.

They account for all bar 16 bytes of the 208 bytes leakage at each iteration. The remaining 16 bytes could be due to boundary alignment or maybe a pointer (or 2). (the default boundary alignment for each malloc is 8 bytes, suggesting this is the cause)

The legacy linuxcnc system always assumed that all components would be loaded 'en-bloc' and thereafter no more are possible.
Even the new instantiated components are seldom loaded 'on the fly' even though they could be.

The repeated loading and unloading of a component was not envisaged and was specifically warned against.

The way to fix this, outside of a root and branch rewrite of HAL memory routines, appears to me to be to instigate the mechanism which hal_lib.c appeared to suggest was used but is not present, the re-use of memory space for components which are reloaded. There is commented out code in linuxcnc also suggesting this course.

It could be an option to have some memory set aside specifically for this, when the HAL heap is set up, which would simplify the process greatly, over having a constantly changing linked list of memory addresses to step through for instance.

Thinking time and do a scan of the code pre multicore, to see if the implementation did exist at that time and can be reused.

l29ah commented 6 years ago

Maybe related to my temp measurement problems: it seems to me like halcmd commands live in their own world when the gcode interpreter is crunching g1's: i try manipulating the pins of my pid instance, but it doesn't seem to influence its behaviour, even though getp tells me the values i've set. I'm not sure if i'm missing anything important.

ArcEye commented 6 years ago

I don't know what you mean.

When the interpreter is on a blocking M code, it does nothing until it returns. All the pin values are available, as my component to replace M109 showed

halcmd is a component. The sequence which is causing this problem is thus.

hal_lib loads a base component, which gives access to the hal library functions and sets up the hal memory heap and data structure.

Each halcmd usage, loads an iteration of hal_lib\<PID> as a component and then an iteration of halcmd\<PID>. When the command it was called for is completed, both the hal_lib\<PID> and halcmd\<PID> are unloaded but the memory is not freed.
The component has all its pins and params freed etc. and its header is marked as 'invalid', but it still lives in the list of loaded components and occupies the memory it was allocated.

Only when realtime is stopped and the base hal_lib is closed, does that call hal_sweep(), which iterates through all the components in the list, freeing their memory back to the hal heap, before that in turn is freed.

I suspect this dates back to the inception of instantiated components. Because a 'base component' is loaded and then instances are instantiated from it, strict housekeeping would be required to track all the instantiations and ensure all was freed on shutdown.

This works absolutely fine 99.9% of the time, because people do not keep launching components after the machine is initialised.

However this breaks in the corner cases where halcmd show terminal output is used repeatedly to read a pin value, which is an inherently bad idea anyway, because it relies upon a rigidly formatted output and can be broken by a single change to that output text.

Because the iterations of hal_lib\<PID> and halcmd\<PID> do not return their memory to the hal heap, eventually there is no memory left to allocate.

The issue of boundary alignment was a 'red herring'.
The usual way of dealing with this is to have a pointer to the real start of the memory allocated behind the actual pointer issued. Thus a pointer of 07ff8888 would be the start of the aligned memory to be used and the pointer which holds the actual start point to be used when freeing memory, would be at the address, 07ff8888 - (sizeof pointer). Then you free sizeof(component structure) + sizeof(alignment offset) It looks like the code is doing something similar.

summerlotus513 commented 1 year ago

I also encountered this problem recently when using halcmd, is there a solution?