gmegan / specification

OpenSHMEM Application Programming Interface
http://www.openshmem.org
1 stars 0 forks source link

Destruction of teams and contexts made from teams #57

Closed gmegan closed 5 years ago

gmegan commented 5 years ago

In the last call, several questions were opened about automatic team destruction.

Does shmem_finalize destroy all existing teams? This is not currently specified.

When a team is destroyed, what happens to the contexts created from that team? Are they destroyed? In particular, if there are private contexts created from the team, they cannot be touched by a thread other than the one that created them, so this would seem to indicate that they would not be destroyed when the team is destroyed.

If a team is destroyed and there are still existing contexts created from that team, what is returned from shmem_ctx_get_team? How is p2p operation behavior semantically defined for contexts with no existing team?

nspark commented 5 years ago

Here are my initial thoughts...

Does shmem_finalize destroy all existing teams?

No, not any more than shmem_finalize frees any memory allocated with shmem_malloc and friends. That is, the runtime should clean up sufficiently to allow a subsequent job to use the node/system, but it shouldn't explicitly track and free team resources.

When a team is destroyed, what happens to the contexts created from that team? Are they destroyed?

I think it should be undefined to access a context that was created from a team that has already been destroyed. Again, I don't think the team should necessarily need to track the contexts created from it. An implementation may do so, if it makes sense for that implementation; but I don't want to require it.

If a team is destroyed and there are still existing contexts created from that team, what is returned from shmem_ctx_get_team? How is p2p operation behavior semantically defined for contexts with no existing team?

My vote is for undefined behavior on both counts. A context should not be valid if the team from which it was created has already been destroyed.

gmegan commented 5 years ago

From the call yesterday, undefined behavior for using team resources after team destruction seems like the most straightforward option.

It seems that the semantics of team destruction could be stated that upon successful return from the team destruction routine, any resources assigned to the team will be freed. Since that would include network resources, then any context made from the team would need to be destroyed before the team is destroyed, otherwise it would be tying up resources that should be freed.

Since team destroy routine running in thread 0 would not be able to do anything about contexts existing in other threads, what happens with the network resources seems like it would depend on a number of things. Maybe the implementation lets the resources leak with or without a warning and the contexts just keep working until they are destroyed by the other threads. Maybe it frees the resources and then crashes horribly when some context object attempts to use those resources and its references are invalid.

It seems like it could be stated that

This would allow flexibililty as long as it is clear that a context is, itself, NOT a team resource but a consumer of team resources. Stated this way, an implementation could legally do any of the following in the case where a team is destroyed with contexts still existing:

From the programmer perspective, it is simple: Always destroy all contexts before the team to avoid a world of sadness. Its good to check for failure of team destroy if you are debugging since there are all kinds of reasons resources could get stuck... but you can't use failure of team destroy to debug your program of leaking contexts (or leaking heap, etc) unless you have some special implementation that does that specifically.

gmegan commented 5 years ago

Resolved with changes to team destroy API