Latency tips - Githubissues

RubenTeunisse commented 2 years ago

Hi! I'm trying to use your mouse model for a real-time application. Unfortunately, the prediction call is taking about 20ms for me, while in the paper you report that it can be done under 10ms. I am starting and stopping the timing right around the call, like:

           start = time.time()
           output = self.das_model.predict_on_batch(input)
           print("    DAS single forward pass: ", time.time() - start)

Do you have any tips on how I could increase the inference speed?

Thanks! Ruben

Details: input shape: ((1,8092,1),) 12th Gen Intel(R) Core(TM) i9-12900K 128GB RAM GeForce RTX 3090

postpop commented 2 years ago

Hi! I re-ran the notebook that generated the latency measurements for the paper and I still get latencies around 8 ms.

This is the code I use to benchmark the models:

import time
import das.utils
import numpy as np
import os

USE_GPU = False  # Test with or w/o GPU
os.environ['CUDA_VISIBLE_DEVICES'] = '0' if USE_GPU else '-1'
try:
    import tensorflow as tf
    from tensorflow.python.framework.ops import disable_eager_execution
    disable_eager_execution()
    physical_devices = tf.config.list_physical_devices('GPU') 
    print(physical_devices)
    USE_GPU = len(physical_devices)>0
except:
    pass

save_name = 'PATH_TO_THE_MODEL'
model = das.utils.load_model(save_name, model_dict=das.models.model_dict)

for ii in range(10):
    x = np.random.random((1, 8192, 1))
    t0 = time.time()
    model.predict(x)
    elapsed = time.time() - t0
    print(time.time() - t0)

A couple of tips:

Make sure to disable eager execution via disable_eager_execution() - this reduces latency from ~30ms to ~8ms.
Latency is similar for GPU and CPU.
The first iteration always takes longer. This effect of a cold-start is largest when predicting for the first time after loading the model.

RubenTeunisse commented 2 years ago

Fantastic, disable_eager_execution() did wonders! And thanks for the quick reply!

janclemenslab / das

Latency tips #45