Closed ewencp closed 11 years ago
You could try http://graphics.stanford.edu/~danielrh/microLockErase.patch or if that's too messy/complex the simpler
http://graphics.stanford.edu/~danielrh/scopedErase.patch
basically they release the lock during the reset period...anything wrong/risky with that?
@danielrh Thanks, went with a very slightly different solution, but gist is the same.
I sometimes get SST hanging trying to clean up. Here's the stack trace where it's hanging:
There's nothing obviously wrong with the code or the stack trace. It's hung looping in the rebalancing method for the map. I think this is due to memory corruption because of different threads trying to operate on the map of connections at the same time.
If you look at the comment in Connection::closeConnection, we don't acquire a lock because this should only be happening during shutdown. It used to only be called in the ConnectionManager destructor, but we added a call to this in ConnectionManager::stop() because otherwise SST will keep everything from shutting down. This means that this will get called during normal operation with multiple threads active and possibly accessing SST streams/connections.
The obvious solution would be to just acquire the lock mentioned in the comment but it also suggests we'd deadlock if we did. If its safe to do so we could just convert it to a recursive_mutex.
This happens fairly frequently with some code I've been testing, but the test isn't small and has some other issues. If there's a patch with a possible fix I can test it before it gets committed.