Patch for GossipGraD algorithm

What does this PR do? Please describe: Currently GossipGraD algorithm increases state.iteration every time comm_hook is called and later changes topology based on this state.iteration. This is incorrect, because during the same backward comm_hook can be called multiple times. Current patch addresses this issue.

Now GossipGraD requires a num_modules parameter, which is used to calculate proper time when to switch topology.

Appropriate unittests are added. New experimental results show general improvement in performance.

Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible API changes.

Check list:

[ ] Was this discussed and approved via a GitHub issue? (not for typos or docs)
[X] Did you read the contributor guideline?
[X] Did you make sure that your PR does only one thing instead of bundling different changes together?
[X] Did you make sure to update the documentation with your changes? (if necessary)
[X] Did you write any new necessary tests?
[X] Did you verify new and existing tests pass locally with your changes?
[ ] Did you update the CHANGELOG? (not for typos, docs, or minor internal changes)

pytorch / torchdistx

Patch for GossipGraD algorithm #56