Destruction of teams and contexts made from teams

gmegan commented 5 years ago

In the last call, several questions were opened about automatic team destruction.

Does shmem_finalize destroy all existing teams? This is not currently specified.

When a team is destroyed, what happens to the contexts created from that team? Are they destroyed? In particular, if there are private contexts created from the team, they cannot be touched by a thread other than the one that created them, so this would seem to indicate that they would not be destroyed when the team is destroyed.

If a team is destroyed and there are still existing contexts created from that team, what is returned from shmem_ctx_get_team? How is p2p operation behavior semantically defined for contexts with no existing team?

nspark commented 5 years ago

Here are my initial thoughts...

Does shmem_finalize destroy all existing teams?

No, not any more than shmem_finalize frees any memory allocated with shmem_malloc and friends. That is, the runtime should clean up sufficiently to allow a subsequent job to use the node/system, but it shouldn't explicitly track and free team resources.

When a team is destroyed, what happens to the contexts created from that team? Are they destroyed?

I think it should be undefined to access a context that was created from a team that has already been destroyed. Again, I don't think the team should necessarily need to track the contexts created from it. An implementation may do so, if it makes sense for that implementation; but I don't want to require it.

If a team is destroyed and there are still existing contexts created from that team, what is returned from shmem_ctx_get_team? How is p2p operation behavior semantically defined for contexts with no existing team?

My vote is for undefined behavior on both counts. A context should not be valid if the team from which it was created has already been destroyed.

gmegan commented 5 years ago

From the call yesterday, undefined behavior for using team resources after team destruction seems like the most straightforward option.

It seems that the semantics of team destruction could be stated that upon successful return from the team destruction routine, any resources assigned to the team will be freed. Since that would include network resources, then any context made from the team would need to be destroyed before the team is destroyed, otherwise it would be tying up resources that should be freed.

Since team destroy routine running in thread 0 would not be able to do anything about contexts existing in other threads, what happens with the network resources seems like it would depend on a number of things. Maybe the implementation lets the resources leak with or without a warning and the contexts just keep working until they are destroyed by the other threads. Maybe it frees the resources and then crashes horribly when some context object attempts to use those resources and its references are invalid.

It seems like it could be stated that

The team destruction routine fails if some resource cannot be eventually freed without further user involvement. It seems overly restrictive to require that internal structures outside of user control be guaranteed free at call return. The user should just not have to worry about them anymore after this.
Using a context created from a team after the team is destroyed results in undefined behavior

This would allow flexibililty as long as it is clear that a context is, itself, NOT a team resource but a consumer of team resources. Stated this way, an implementation could legally do any of the following in the case where a team is destroyed with contexts still existing:

Case 1: Not track anything related to contexts created from the team, and just free resources blindly and return success from team destroy for any valid team. Using contexts that rely on those resources later would probably cause the program to blow up.
Case 2: Track reference counters of contexts and not destroy any resource if it might still be in use, then return failure from team destroy and leave the team intact. Contexts would still work and the user would have to retry team destruction later to free team resources.
Case 3: Keep a full forward and backward mappings from teams to contexts. When destroying a team, all the resources are marked to be freed and team destruction is successful in any case. If any contexts are still in use they are marked invalid and further use of them results in errors or warnings. When those contexts are destroyed the resources are freed automatically since the context was marked invalid. If a team creation fails due to resources still being stuck, some error message includes info about all of the resources still waiting to be freed.

From the programmer perspective, it is simple: Always destroy all contexts before the team to avoid a world of sadness. Its good to check for failure of team destroy if you are debugging since there are all kinds of reasons resources could get stuck... but you can't use failure of team destroy to debug your program of leaking contexts (or leaking heap, etc) unless you have some special implementation that does that specifically.

gmegan commented 5 years ago

Resolved with changes to team destroy API

gmegan / specification

Destruction of teams and contexts made from teams #57