pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.15k stars 6.94k forks source link

Improve testing for models #1234

Open fmassa opened 5 years ago

fmassa commented 5 years ago

We currently have very limited testing for the models in torchvision.

In most cases, we only test for the output shape to be correct, see https://github.com/pytorch/vision/blob/8635be94d1216f10fb8302da89233bd86445e449/test/test_models.py#L29-L59 While this allows to catch bugs such as Py2-Py3 differences, it is not enough to ensure that refactorings of the models keep the behavior the same (and that pre-trained weights are still valid).

We should come up with a better testing strategy.

Ideally, we would not want to download the pre-trained weights, because this can make the tests prone to failure due to IO issues.

One possible approach would be to fix the random seed and initialize the models in a particular way and compare the output we get for a particular input with an expected output, ensuring that the output is non-trivial (such as all-zeros or empty, which could happen due to ReLU or for detection models). Something in the lines of

torch.manual_seed(42)
m = torchvision.models.resnet18(num_classes=2)
torch.manual_seed(42)
i = torch.rand(2, 3, 224, 224)
output = m(i)
expected_output = torch.tensor([[0.0002, 0.3], [0.245, 0.001]])
assert output.equals(expected_output)

The problem with this approach is that we do not guarantee that the RNG is the same between different versions of PyTorch, which means that this way of testing the model would lead to failures if we change PyTorch's RNG.

This issue will be particularly important to be addressed soon, because we will be adding some slight modifications to the model implementations in order for them to be traceable / ONNX-exportable. @lara-hdr is currently looking into making the torchvision models ONNX-ready.

cc @fbbradheintz , who showed interest in addressing this issue.

fbbradheintz commented 5 years ago

I'm on it, @fmassa - I'll post a comment here here with a plan.

fbbradheintz commented 5 years ago

My current plan:

This process will be repeated for all models covered by the test, probably in a subclass that specializes the initialization, input, and output based on the needs of the model.

One possible pitfall is PRNGs giving different results for same seeds on different platforms/stdlib versions. I'll try to test a cross-section of build environments to see whether this is the case. If it is, the fallback will probably be some other algorithmic generation of weights that provides a reasonable distribution of values (perhaps based on the statistical distribution of values in the pretrained versions of the models).

fmassa commented 5 years ago

@fbbradheintz This sounds good to me!

The PRNGs are going to be different between CPU/GPU for sure, but I'm not sure if it will be different depending on the platform -- to be tested.

This will at least be a good starting point, and if we make the model initialization it's own method in a tester class, then we can change the initialization of all models in a single function change.

SebastienEske commented 5 years ago

Just an idea but if you could select the RNG in torch.random and maybe have a basic one for testing purposes that would solve the problem, no? Since it's for testing it doesn't even need to be super random.

fmassa commented 5 years ago

@SebastienEske yes, that's an idea that could work. It might be a bit more complicated to make it work for detection models (which have lots of conditional and dynamic blocks) though, as we would want to have a non-trivial solution to be output (like a solution with some non-empty boxes).

pmeier commented 5 years ago

Playing the devils advocate here:

Have you considered the non-determinism that some functions in the models might exhibit? These effects might occur even when the same action is performed twice in a row on the same system without intermittent meddling. Thus, simply testing for the same output could be flaky.

For more information on the non-determinism see this talk. It is mainly centered around TensorFlow, but there is a small segment starting at 32:31 about PyTorch.

The author states:

I hear that some ops may still be non-deterministic

which lead me to try to confirm this. Ultimately, I "reproduced" the non-determinism and asked a question about it in the official forum. In short:

You cannot eliminate non-determinism even if you follow the official guidelines for reproducibility.

fmassa commented 5 years ago

The non-determinism is a real complication, and I believe @fbbradheintz has hit some of those non-determinisms with a few models.

fbbradheintz commented 5 years ago

I've seen it with the Inception v3 model for sure. The rest seem well behaved when using default constructors and a seeded PRNG.

Once I have the deterministic models sorted out, I'll circle back to Inception and any others that need attention.

t-vi commented 5 years ago

Just in case: For the forwards (which seems most relevant for tracing, nll_loss2d and scatter_* are the prime suspects for non-determinism).

fbbradheintz commented 5 years ago

Just to follow up on the non-determinism thread: The culprit was the numpy PRNG. This is sorted out in the draft PR.

pmeier commented 5 years ago

1424 for cross reference.

fbbradheintz commented 4 years ago

@fmassa - my last PR covered the classification models. Do we want to handle video, detection, and segmentation as well before we call this complete?

fmassa commented 4 years ago

@fbbradheintz yes, we would like to have all models covered by correctness checks if possible.

fbbradheintz commented 4 years ago

@fmassa - There are currently multiple test failures (mostly related to scriptability) for segmentation, detection and video models. I'll watch for those to clear up before I start building on them (unless you have a suggestion for a branch besides master to build on).

There are some other passing tests that still need correctness tests added, I'll address them in the meantime.

fmassa commented 4 years ago

@fbbradheintz CI should be green now