uniba-swt / swtbahn-cli

A client-server command line interface for the SWTbahn.
GNU General Public License v3.0
7 stars 3 forks source link

Potential Deadlocks #78

Closed eyip002 closed 1 year ago

eyip002 commented 2 years ago

Debugging an unresponsive swtbahn-server:

Backtraces generated when swtbahn-server becomes unresponsive:

#3  0x0002981c in dyn_containers_set_train_engine_instance (grabbed_train=0x4a28c <grabbed_trains>, train=0xb2b64426 "cargo_db", engine=0xb2b64436 "libtrain_engine_default (unremovable)")
    at /home/pi/Desktop/SWTbahn/swtbahn-cli/server/src/dyn_containers_interface.c:388
#4  0x0002cae8 in grab_train (train=<optimized out>, engine=<optimized out>) at /home/pi/Desktop/SWTbahn/swtbahn-cli/server/src/handler_driver.c:287
#2  0x76a52fb0 in usleep (useconds=<optimized out>)
    at ../sysdeps/posix/usleep.c:32
#3  0x00019a80 in dyn_containers_set_train_engine_instance ()
#4  0x0001d7a4 in grab_train ()
#2  0x76a52fb0 in usleep (useconds=<optimized out>)
    at ../sysdeps/posix/usleep.c:32
#3  0x00019c2c in dyn_containers_free_train_engine_instance ()
#4  0x0001d2b0 in free_train ()

Probably fixed:

#1  0x76cb5f44 in __GI___pthread_mutex_lock (
    mutex=0x3d3e0 <grabbed_trains_mutex>) at pthread_mutex_lock.c:80
#2  0x0001d5e4 in grab_train ()
#1  0xb6cb9f44 in __GI___pthread_mutex_lock (mutex=0x3a3e#0  __lll_lock_wait (futex=futex@entry=0x3a3e8 <grabbed_trains_mutex>, private=0) at lowlevellock.c:438 <grabbed_trains_mutex>)
    at pthread_mutex_lock.c:80
#2  0x0001c6cc in train_grabbed (train=0xf9c06 "cargo_db")
    at /home/pi/Desktop/SWTbahn/swtbahn-cli/server/src/handler_driver.c:79
#1  0x76cb5f44 in __GI___pthread_mutex_lock (mutex=0x3e3bc <start_stop_mutex>)
    at pthread_mutex_lock.c:80
#2  0x0001b928 in handler_startup ()
#2  0x76a4ffb0 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:32
#3  0x0001aa80 in dyn_containers_free_interlocker_instance ()
#4  0x0001c30c in free_all_interlockers ()
eyip002 commented 2 years ago

Note: Fixing one deadlock may resolve other deadlocks.

eyip002 commented 1 year ago

Writes to the variables in_use in dyn_containers.forec might not be committed properly or transferred to the outputs.

Trace

1. Driver request: Release grabbed train

Server response: Released grabbed train External/internal state:

train_engine_instances_io[0] 
    input_grab: 0, 0 
    input_release: 0, 0 
    input_train_engine_type: 0, 0 
    input_requested_speed: 0, 0 
    input_requested_forwards: 1, 1 
    output_in_use: 0, 0 
    output_train_engine_type: -1, -1 
    output_nominal_speed: 0, 0 
    output_nominal_forwards: 1, 1 

2. Driver request: Grab cargo_db

Server response: No response! Dynamic train engine container response:

External/internal state:

train_engine_instances_io[0] 
    input_grab: 1, 1 
    input_release: 0, 0 
    input_train_engine_type: 0, 0 
    input_requested_speed: 0, 0 
    input_requested_forwards: 1, 1 
    output_in_use: 0, 1                    <-- Internal state not able to written to external state!
    output_train_engine_type: -1, 0 
    output_nominal_speed: 0, 0 
    output_nominal_forwards: 1, 1 
eyip002 commented 1 year ago

Trace

1. Drive request: Set DCC speed to 13

Server response: DCC train speed set to 13

Jun 29 16:36:50 raspberrypi3 swtbahn[18450]: server: Request received: Set train speed
Jun 29 16:36:50 raspberrypi3 swtbahn[18450]: server: engineInstance0: libtrain_engine_default (unremovable).tick()  13 1 -> 13 1
Jun 29 16:36:50 raspberrypi3 swtbahn[18450]: libbidib: Set train: cargo_db to speed: 13 via board: master (0x00 0x00 0x00 0x00) with action id: 637
Jun 29 16:36:50 raspberrypi3 swtbahn[18450]: server: Request: Set train speed - train: cargo_db speed: 13

External/internal state:

* dyn_containers_reaction_counter: 147021 
* dyn_containers_actuate_reaction_counter: 146588 
* dyn_containers_set_train_engine_instance_reaction_counter: 0 

train_engine_instances_io[0] 
    input_grab: 0, 0 
    input_release: 0, 0 
    input_train_engine_type: 0, 0 
    input_requested_speed: 13, 13 
    input_requested_forwards: 1, 1 
    output_in_use: 1, 1 
    output_train_engine_type: 0, 0 
    output_nominal_speed: 13, 13 
    output_nominal_forwards: 1, 1 

2. Driver request: Set DCC speed to 0

Server response: DCC train speed set to 0

Server did not log any messages!

External/internal state:

* dyn_containers_reaction_counter: 147083 
* dyn_containers_actuate_reaction_counter: 146723 
* dyn_containers_set_train_engine_instance_reaction_counter: 0 

train_engine_instances_io[0] 
    input_grab: 0, 0 
    input_release: 0, 0 
    input_train_engine_type: 0, 0 
    input_requested_speed: 0, 13          <-- External state not able to written to internal state!
    input_requested_forwards: 1, 1 
    output_in_use: 1, 1 
    output_train_engine_type: 0, 0 
    output_nominal_speed: 13, 13 
    output_nominal_forwards: 1, 1

Counter values some time later:

* dyn_containers_reaction_counter: 147083              <---ForeC program has stopped reacting...
* dyn_containers_actuate_reaction_counter: 211279 
* dyn_containers_set_train_engine_instance_reaction_counter: 0 
eyip002 commented 1 year ago

Seems like the problem is with the ForeC compiler generating erroneous code for getting the current wall time. The elapsed reaction can be a negative number, which can happen when clock_gettime() fails and 0 is returned for the current time. This can lead to a very large delay value being passed into usleep(). The clue was in this backtrace:

#2  0xb6a5f8e0 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:32
        ts = {tv_sec = 4294, tv_nsec = 956530000}
#3  0xb6b0eb88 in forecMain (args=<optimized out>) at /home/pi/SWTbahn/swtbahn-cli/server/bin/dyn_containers.c:1154

Here, the delay passed into usleep is over 4294 seconds (71 minutes). So, if we waited for 71 minutes, the dynamic containers would become responsive again.

This ForeC compiler defect has been fixed: https://github.com/PRETgroup/ForeC/commit/cb1867e70082e5cea3eb08ef075071f1bb6db546

eyip002 commented 1 year ago

Tested for over 3 hours (over 1 million reactions) and still works 👍