grc-iit / ChronoLog

ChronoLog: A High-Performance Storage Infrastructure for Activity and Log Workloads
https://chronolog.dev
BSD 2-Clause "Simplified" License
5 stars 4 forks source link

ChronoVisor occasionaly hangs in AquireStory call #100

Closed ibrodkin closed 4 months ago

ibrodkin commented 8 months ago

ChronoVisor occasionally hangs with never returning ClientVisorPortal::AcquireStory request.respond .

IT appears that the hanging is more likely to happen the closer the AcquireStory calls from multiple client processes are to each other. The case can be occasionally reproduced using " mriexec -n 4 client_lib_multi_storytellers" with chronolog version "2023-10-01". It looks like the clientVisor channel execution stream never gets execution time again after it gets into this state. New requests from new Clients are not getting accepted while other ChronoVisor threads /execution streams continue processing as expected, communication with the ChronoKeeper is not affected at all. Attempts to shutdown ChronoVisor services gracefully when it gets into this state hang midway as well

ibrodkin commented 7 months ago

I did some more digging on this issue and it turns out that the both execution streams (threads) that are currently allotted to client request processing are hanging on registryLock acquisition in KeeperRegistry::notifyKeepersOfStoryStart() even though the debugger shows that registryLock has been successfully acquired by one of the threads.

The registryLock is used in notifyKeepersOfStoryStart() & notifyKeepersOfStoryStop() to prevent the case when the Keeper process we are trying to notify might unregister while the notify call is iterating over the KeeperRegistry map. We can get away without lock acquisition for the time being as the pool of Keepers in our first release is static. We would have to revisit this case when we introduce dynamic addition/removal of Keeper processes

ibrodkin commented 4 months ago

changes merged