Malte lig 4894 fix gatherlayer

closes #1528

Description

Fixes the GatherLayer by using the implementation from solo learn.

Tests

Adds a test for the GatherLayer by testing a model using the NTXentLoss criterion. It compares the training behaviour for these two cases and ensures that it is exactly the same:

n_devices=1, batch_size=8
n_devices=2, batch_size=4

The test is in the new file test_dist__gather.py. This is needed, because using a DDPStrategy causes the file to be executed once per device. Before the fix, the test failed.

Next tests

This test only tests the NTXentLoss criterion, the other models need to be tested as well.

Testing the full SimCLR model

I also tried to have similar test when using a SimCLR model. However, it is extremely hard to get exactly the same training when using it.

Randomness causes different behaviour between n_devices=2 and n_devices=1

Results:

Using the SimCLR transform leads to different behaviour when using n_devices=2 compared to only 1 device. Even seeding does not help. This is caused by the different number of samples and thus the different random seeds. E.g.

n_devices = 1, batch_size=8 processes the samples in order 0, 1, ... 7 and thus uses seed_0, seed_1, .... seed_7. Thus sample_1 uses seed_1
n_devices = 2, batch_size=4 processes the samples in order 0, 2, 4, 6 (device 0) and 1, 3, 5, 7 (device 1). Each of them uses seed_0, seed_1, ... in parallel, as these are two different process with their own seeding each. Thus sample_1 gets seed_0, which makes it differ.

Thus only removing randomness makes the output of the dataloader the same for the n_devices=2 and n_devices=1 cases.

The same problem also applies to any randomness in the model itself, e.g. in dropout layers.

Batch normalization causes different behaviour between n_devices=2 and n_devices=1

Batch normalization or any other operation using information from other samples in the same batch behaves differently when using n_devices=2 & batch_size=4 compared to n_devices=1 & batch_size=8. The batch normalisation would need to be synchronised as well for this to work. As pointed out by Guarin, we could use SyncBatchNorm to avoid this: https://lightning.ai/docs/pytorch/stable/common/trainer.html#sync-batchnorm

lightly-ai / lightly