mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

SSG fails to observe group after timing out #67

Open robertu94 opened 2 years ago

robertu94 commented 2 years ago

Summary

It is desirable to be able to timeout and retry observing a group using ssg_group_id_load + ssg_group_refresh. This might happen because the group file is stale because the filesystem hasn't yet made visible the changes. However if the group file exists but is stale or imcomplete, SSG sometimes puts itself in an unrecoverable state and will fail to read the file when it is presumably eventually correct.

I've attached a reproducer that shows produces the output

Example Output

[runderwood@defiant ssg]$ ./test.py 
CLIENT init
CLIENT load
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
CLIENT Error: SSG exceeded max retries for refreshing group
CLIENT failed to contact ssg; retrying in 1sec
CLIENT Error: Unable to deserialize SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
CLIENT Error: Unable to deserialize SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
CLIENT Error: Unable to deserialize SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
CLIENT Error: Unable to deserialize SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
CLIENT failed to connect; gave up after retries
TIMEOUT!!

Specifically the problematic part is that it is unable to de-serialize the SSG group ID before hitting the programmatic timeout.

Expected Behavior

You get the behavior that I expect if you delete the mygroup.ssg file before starting both programs.

[runderwood@defiant ssg]$ ./test.py --fix
Applying Fix
CLIENT init
CLIENT Error: Unable to open file mygroup.ssg for loading SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
CLIENT Error: Unable to open file mygroup.ssg for loading SSG group ID
CLIENT failed to contact ssg; retrying in 1sec
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
SERVER ready
CLIENT load
CLIENT refresh
CLIENT size
CLIENT observed 8 members
CLIENT 2808869545400879264 ofi+tcp;ofi_rxm://192.168.1.154:35169
CLIENT 4209866839082429807 ofi+tcp;ofi_rxm://192.168.1.154:44353
CLIENT 7214676906563830725 ofi+tcp;ofi_rxm://192.168.1.154:38829
CLIENT 8619883823355301791 ofi+tcp;ofi_rxm://192.168.1.154:44617
CLIENT 9211154232890242786 ofi+tcp;ofi_rxm://192.168.1.154:43841
CLIENT 10946497040899826855 ofi+tcp;ofi_rxm://192.168.1.154:40949
CLIENT 11757443851160541191 ofi+tcp;ofi_rxm://192.168.1.154:44409
CLIENT 17614477842082925191 ofi+tcp;ofi_rxm://192.168.1.154:43381
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:35169
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:44353
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:38829
SERVER finalizing 2808869545400879264
SERVER finalizing 4209866839082429807
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:44617
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:43841
SERVER finalizing 8619883823355301791
SERVER server left 4209866839082429807
SERVER finalizing 7214676906563830725
SERVER server left 2808869545400879264
SERVER server left 7214676906563830725
SERVER server left 8619883823355301791
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:40949
SERVER finalizing 9211154232890242786
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:44409
SERVER finalizing 10946497040899826855
SERVER server left 9211154232890242786
SERVER server left -7500247032809724761
CLIENT shutting down server ofi+tcp;ofi_rxm://192.168.1.154:43381
SERVER finalizing 11757443851160541191
SERVER server left -6689300222549010425
SERVER finalizing 17614477842082925191
SERVER server left -832266231626626425

BugReproducer.zip

robertu94 commented 2 years ago

Sorry, the spack.yaml file was incomplete. Here is an updated version. BugReproducer-v2.zip