Add unit tests for models

Shashi456 commented 5 years ago

@rxwei what do we check for in models when it comes to unit tests?

rxwei commented 5 years ago

I don't have the best guidelines here, but we should check for convergence for very trivial models (which we probably don't have anymore) and test layer gradients against a Python implementation (with some numerical tolerance).

BradLarson commented 5 years ago

I described some thoughts about this in the design review here, so I'll pull out the bullet points I'd listed there:

Compilation. As a basic requirement, make sure that everything builds.
Inference accuracy. Load a known pretrained set of weights and perform inference on known test inputs. Measure outputs, with some forgiveness for platform-based numerical variations. This would test weight loading and the forward pass on these models.
Inference speed. With the full model, measure inference speed on known hardware configurations for individual and batched inputs.
Small-scale training. Train a reduced version of the model on a small-scale dataset and verify this converges to an expected loss / accuracy within small bounds. For example, the currently-implemented reduced ResNet that trains against CIFAR-10.
Large-scale training. Train representative full-scale models on industry standard benchmarks and verify this converges to an expected loss / accuracy within small bounds. For example, training ResNet (34 / 50) on ImageNet 2012.

What we can do in the training cases might be limited by how long we'll allow presubmits to take, but at the least we'll want to have the inference tests in there for a variety of models.

Pull request #198 is a start on the inference tests, initially just running random tensors through the models at the correct input size to verify that the models at least run and produce correctly shaped output tensors.

A next step after those would be to verify accuracy of inference on pretrained weights, using known weight snapshots and small but still useful validation sets. I've started with image classification as an easy initial case, but these could then be built out in a similar manner for other models.

Shashi456 commented 4 years ago

@BradLarson Now that we are doing very basic tests for the image models, how do you want to benchmark for inference accuracy and inference speed? do you want to record them on github, or are these to be run locally and updated on this thread? Because IMO, adding accuracy tests might make the basic build very long. We could either have multiple stages of build or check once in a while o.O

BradLarson commented 4 years ago

@Shashi456 - For performance benchmarks, we are building out public and internal infrastructure that will run those independently of the normal unit tests. The Benchmarks target is one aspect of this, and we've started testing out the performance measurements it returns.

When it comes to remaining tasks for unit testing models, verifying inference accuracy of image classification models is at the top of my list. Now that we have a general checkpoint loader system, it would be nice to set up unit tests that load a pretrained checkpoint for a model, perform inference against a validation set from the dataset it was trained against, and verify that accuracies match those of the original trained model (even if the model was trained in a different framework).

How to do this in a practical manner with these models is an open question. For classification models trained against CIFAR-10 or MNIST, the datasets are small enough to run against the full validation set and gather detailed inference accuracies pretty quickly. The more common ImageNet-trained models might require a specially crafted validation set that's a smaller subset of the original ImageNet validation set (which is ~7 GB in size and impractical to download and run against in a unit test). We might even be able to preprocess the images to shrink them to a known target size to make for a smaller download of this validation set.

For now, we could start with the MNIST- and CIFAR- trained models and build some initial validation unit tests for those to compare against checkpoints from models trained using Python TensorFlow and make sure the concept works. If it does, we can evaluate how to expand that to ImageNet-sized classification networks and others.

tensorflow / swift-models

Add unit tests for models #201