Floating Point Operations and uC/OS III Misbehaving

chaseblock commented 3 years ago

Discovered while investigating #378.

Introducing floating point operations in certain places within the code seem to cause uC/OS to enter a hardfault somewhere around OSSched().

For a reproduction case, see Test_Tasks.c on the RTOS branch. This executes fine. However, when Task2 is modified as below, it hardfaults as described. This behavior has also been seen in the CAN_Task test (see the linked issue). When Task1 is modified in a similar way, everything still appears to work.

The change:

diff --git a/Tests/Test_Tasks.c b/Tests/Test_Tasks.c
index 23f71b19..6864cbef 100644
--- a/Tests/Test_Tasks.c
+++ b/Tests/Test_Tasks.c
@@ -68,6 +68,10 @@ void Task2(void *p_arg) {
     OS_ERR err;

     while(1) {
+
+        volatile float f = 1.8;
+        f += 0.5;
+
         OSSemPost(&SafetyCheck_Sem4,
                     OS_OPT_POST_ALL,
                     &err);

For the hardfault debugging registers: see these dumps from GDB.

>>> p/x *(uint32_t*)(0xE000ED28) # CFSR
$2 = 0x100
>>> p/x *(uint32_t*)(0xE000ED2C) # HFSR
$3 = 0x40000000

This puts us squarely in IBUSERR territory. I unfortunately haven't been able to find many reports of people experiencing these issues, and haven't been able to work around it so far.

@ClarkPoon @JimothyGreene I'm wondering if anyone has any ideas on this. I'm going to keep experimenting.

@SijWoo have you ever seen anything like this?

chaseblock commented 3 years ago

After consulting with some sources:

This is probably due to uC/OS not properly handling the floating point registers during a context switch, and losing track of where on the stack the PC is when it attempts to return from a context switch.

Our options as I see it are:

Don't use uC/OS. I think it may be a bit too late to choose this option, unless y'all like the other two even less.
Look into fixing up the context switch code within the RTOS to support floating point register saving. I don't know if I have the time for this right now. Maybe I'll look into it in a bit. If anyone else is willing to take a look into that though, let me know and I'll see if I can help.
Disable the floating point unit. I can confirm that with the floating point unit disabled, the hardfaults do not seem to be occurring. The downside to this is that we lose hardware support for floating point arithmetic, so all float math is going to be much slower (and implemented via larger amounts of library software). This is probably the easiest solution, but if we choose this, we will want to eliminate any unnecessary floating point math, probably converting most of it to fixed point (see Q).

@ClarkPoon @JimothyGreene I would like your opinions on these options within the next few days so that we aren't blocked by this for a long while.

@Cam0Cow @ErickCortez98 This possibly affects your system as well, although I haven't really looked yet. It might be good for y'all to take a look at any floating point math that y'all are doing and ask whether or not it needs to be floats.

ClarkPoon commented 3 years ago

I looked through the BPS code, and it looks like we are mainly using floating point math for the following:

Our voltage and temperature safety thresholds are floats. I think these used to be fixed point (in millivolts/milli-Celsius), so it could be converted to fixed point.
Pretty much all of the measurements we are sending over CAN are floats (voltage, temperature, current). Using floats here is unavoidable unless we change our CAN requirements, which would affect array and sunlight, since they are reading these messages.
Converting voltages on the temperature probes to temperatures. This can probably be converted to integer math.
Calculating state of charge. This looks straightforward to convert to integer math.

Do you know how long it takes to execute a floating point division on the stm32f413 without the FPU? I am concerned about point 3 because we are expecting to send over 900 floating point values over CAN each second (10 x 31 voltages + 10 x 62 temperatures). This seems like it could be quite CPU intensive if each typecast + division takes a while. It might be worth it to consider a fourth option:

Remove as much floating point math as possible, but keep the FPU on and treat floating point operations as a critical section (eg. disable interrupts when we do them).

If we chose to leave the FPU enabled, we would have to make sure we have a handler set up to catch any faults caused by this issue

ClarkPoon commented 3 years ago

From a BPS perspective, I think the best option would be to remove all floating point and change the CAN spec (which would change the data format that array and sunlight receive from BPS). What do you think @chaseblock @JimothyGreene @afnanmmir @dimembermatt

lhr-solar / BPS

Floating Point Operations and uC/OS III Misbehaving #388