pioneers / runtime

Firmware for the PiE kit robots and devices
7 stars 2 forks source link

[FATAL] Net Handler "Failed to malloc" on send_log_msg() #136

Closed levincent06 closed 3 years ago

levincent06 commented 3 years ago

Net Handler CLI fails to malloc after being idle for ~3 minutes. After this happens, sending another run mode command will exit the net handler cli without warning. Attempting to restart net handler cli won't work (Client is neither Dawn nor Shepherd) and any attempts to switch run mode won't register in shared memory. The fix for now is to simply restart the Raspberry Pi when this happens.

Steps to reproduce(?): 1) Run all four components of runtime with a connected KoalaBear 2) Switch between IDLE/AUTO multiple times 3) Wait a few minutes

Might be able to reproduce by having just shared memory and net handler. May or may not be linked to

levincent06 commented 3 years ago
levincent06 commented 3 years ago
benliao1 commented 3 years ago

Added some debugging logs and prints messages in tcp_conn.c and logger.c to try and catch this error when it occurs. Only happens on Vincent's raspi and Ashwin's docker-on-raspi.

levincent06 commented 3 years ago

Update! This was reproducible on Daniel's Pi as well. Steps to reproduce (consistently!) Write student code that refers to connected devices in the global variables. i.e. This code is valid, and should run as expected This is sufficient:

MOTOR = '6_<valid uid>'
def autonomous_setup():
    pass
def autonomous_main():
    pass
def teleop_setup():
    pass
def teleop_main():
    pass

Steps to Reproduce

Standard Dawn/Runtime Setup

  1. Have Dawn open. Upload a valid student code file if it's not already on the Pi.
  2. Start up Runtime manually without systemd a. SHM Dashboard b. Have KoalaBear connected with uid <valid uid> in student code, then open Dev Handler c. Net Handler (Dawn should autoconnect) d. Executor CLI (Should automatically use our valid student code)

Standard Run Mode Switching

  1. Dawn: Send TELEOP
  2. Dawn: Send IDLE

Upload invalid student code

  1. Dawn: Change the MOTOR uid to something invalid. Changing one digit should be fine.
  2. Dawn: Upload the invalid student code
  3. Executor CLI: Refresh the student code being used i. student code, followed by studentcode

Execute invalid student code

  1. Dawn: Send TELEOP -- Executor CLI should throw a DeviceError/ subscription error because no such device with the specified uid is connected
  2. Dawn: Send IDLE

Observe error

  1. Wait 3 to 4 minutes.
  2. See Fail to malloc payload error

We'll continue to investigate this issue and see if we can reproduce it with as few steps as possible.

levincent06 commented 3 years ago

The problem was due to send_log_msg() being called over 20 times per second and returning due to an EOF on the log FIFO. The fix is to properly free() any allocated memory before returning. We also require that there is no EOF before calling this function. Issue #82 has been re-opened to investigate this EOF.