Does end_on_device make sense?

DilipSequeira commented 3 years ago

The rationale for start_from_device is that submissions should not need to incur the overhead of transfer from system DRAM if there is a mechanism whereby network inputs can be delivered directly into accelerator memory.

Is end_on_device symmetric in this regard - e.g. submitters should not have to incur an overhead for transfer to system DRAM if the accelerator has the equivalent outbound capability?

@tjablin opinion?

tjablin commented 3 years ago

Thinking about real applications, start_on_device makes sense because you can imagine streaming images or text or other inputs directly from the network in a real application. There's some subtlety in that real application would probably stream compressed images and decompression might have use the CPU. For end_on_device to make sense, there would have to be real applications were the output of an inference streams directly out to the network, but for most real applications, inference is not the last step in a pipeline before sending data back to a user.

DilipSequeira commented 3 years ago

I agree inference is rarely the last pipeline step. However, if your accelerator is a general purpose programmable device, it's realistic for it to run post-processing too - for example, in the current proposal for 3D UNet where overlapping 128x128x128 tiles are recombined into a full segmented image, that would best be done on the accelerator. (This is mainly an issue for segmentation workloads, since those are the ones with large data outputs.) And then that combined image might indeed go straight to network.

tjablin commented 3 years ago

For 3D-UNet, didn't we agree that the server case made no sense, that's why it is offline only? I think the start_on_device rule is getting unwieldy to enforce. We should just move to injecting queries over the network, then submitters that implement NIC to accelerator DMAs will be able to measure the benefit directory.

DilipSequeira commented 3 years ago

The timeline for getting that into 1.1 seems quite short, given there's no proposal yet.

DilipSequeira commented 3 years ago

And regarding 3DUNet not being in server... that's correct, but latency is still relevant for 3DUNet in Edge Single Stream.

tjablin commented 3 years ago

Is this issue 3D-UNet specific?

DilipSequeira commented 3 years ago

It's significant only for benchmarks where the output size is large. Today, that's only segmentation.

tjablin commented 3 years ago

Can we get an opinion from the medical advisory board?

DilipSequeira commented 3 years ago

I'm sure we can, but what are we looking for, and how would we act on it?

MLPerf has, historically, set some fairly arbitrary bounds on the timed portion of the benchmark. One thing we could meaningfully ask is for them to suggest what should be timed, and then address the question "what does the post- processing after the timed portion look like for this model?"

Then there are three cases:

there is no post-processing. answer goes straight to network or storage
there is post-processing that cannot reasonably be done on the accelerator
there is, at least sometimes, post-processing that can be done on the accelerator

In case (1), end-from-device can use the same rules as start-from-device. Case (2) is straightforwardly "no". Case (3), which I expect is going to be true in at least some use cases, requires us to make rules to determine whether an accelerator can do the post-processing. Given that the biggest difficulty we struggled with in 1.0 was the tension between rules that are simple to arbitrate, and rules that don't force costs on submitters that they wouldn't incur in production, this doesn't seem like it will help.

How else could we frame the question to the board?

tjablin commented 3 years ago

requires us to make rules to determine whether an accelerator can do the post-processing

Do we need to make a rule? We should just add the post-processing to the benchmark. Then submitters can implement the post-processing on the host or device depending on the capabilities of their system.

DilipSequeira commented 3 years ago

That would be my preference regardless of this question. If we do that, does that mean we should assume the answer is (1) above?

tjablin commented 3 years ago

I think we should ask the Medical Advisory Board:

What does the post-processing after the timed portion look like for this model?
What typically happens to the inference results for 3D-Unet? Are they sent to the screen, network, storage, or somewhere else?

My current thinking is that end_on_host is probably appropriate for 3D-UNet if we add timed post-processing, but I would like to have confirmation from an expert. Unlike most of the other application in MLPerf, there's not a good 3D-UNet analogue at Google, so I am very reluctant to make changes without consulting an expert.

alexkarargyris commented 3 years ago

For clarification purposes I want to share here (thanks @PabloRibalta) what the current reference implementation for 3D-Unet in the Training Benchmark is:

Get a scan (3D image) at certain resolution
Resample to a common voxel spacing
Pad every volume so it is equal or larger than 128
Crop volumes so they are divisible by 64
If a given edge length modulo 64 is larger than 32 it is constant padded, if a given edge length modulo 64 is less than 32 it will be cropped
Split volumes: The window of 128x128x128 is sliding over pre-processed volumes with a stride of 64, representing the overlap of 0.5.
Predict
Stitch and produce final segmentation: a. The result is multiplied by a patch normalizing matrix - a gaussian kernel b. The results are stacked by adding together c. A global normalizing matrix is obtained by stacking patch normalizing matrices d. At the end, the result is divided by the global normalizing matrix e. To obtain the final labels an argmax is used

Notes:

Steps 1-5 are preprocessing steps that need to take place. Think of these steps as data preparation (e.g. normalization and size adjustment).
Step 6 is where the processed images (after step 5) are split into tiles so they can be fed into the model. Why is it 128x128x128? Because this is the size that gave the best accuracy when training the model.
Step 7 is where the model makes prediction on the tile (i.e. 128x128x128).
Step 8 is the post-processing step. The tiles need to be stitched together to produce the original size segmentation.

We believe that steps 1-5 should not count against benchmark time and steps 6-8 could be left to the submitters to optimize (i.e. change the tile size) as long as they meet the expected accuracy or above. Indeed step 8 can probably take place on an accelerator. However it is to my understanding that the Inference closed division doesn't allow hyperparameter change. So submitters have to go with 128x128x128. Is this true?

@tjablin the resulted stiched output may be either displayed on the screen, or sent to the network or stored for later view.

DilipSequeira commented 3 years ago

The hyperparameter question is somewhat off-topic here: I've opened a new issue https://github.com/mlcommons/inference_policies/issues/216

mlcommons / inference_policies

Does end_on_device make sense? #213