Suggestions for improvement

Nunah commented 1 year ago

Hi, great template, thanks for sharing. I also find these tools to be awesome. I'm looking forward to the medium article.

You should add deepsparse to improve CPU inference.

https://github.com/neuralmagic/deepsparse

pip install deepsparse

from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "model.onnx"
batch_size = 1

inputs = generate_random_inputs(onnx_filepath, batch_size)

engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)

It would be interesting to have, along with the template, a working example with the associated results on a small dataset (e.g., Caltech 256). Maybe it is planned with the medium article? The results (e.g., performance, inference time, memory) could be based on modern standard architectures (e.g., ConvNext, ViT) and the different tools (i.e., pytorch, onnx, tensorrt, deepsparse).

Do you also plan to give some tips and numbers on how to train a model faster with these tools (e.g., mixed precision training, torch.compile)?

I think that all of this would make your template even more valuable.

Nunah commented 1 year ago

Other suggestions:

int8 inference with tensorrt (https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#enable_int8_c)
an example with a custom dataset using pytorch (e.g., simple image loading with cache option, LMDB dataset)
an example with a custom network using pytorch (e.g., simple vit, simple convnext)

I have some issues with the benchmark utility from pytorch. I think a warm-up is needed. Also, cuda-python might be another way to do it:

pip install pypiwin32
pip install cuda-python==11.8.0
pip install tqdm

import time
from cuda import cuda
from tqdm import tqdm

(result,) = cuda.cuInit(0)
result, device = cuda.cuDeviceGet(0)
result, context = cuda.cuCtxCreate(0, device) # or cuCtxCreate_v2 or cuCtxCreate_v3
result, free_mem, total_mem = cuda.cuMemGetInfo() # or cuMemGetInfo_v2 or cuMemGetInfo_v3
print("Total Memory: %ld MiB" % (total_mem / 1024**2))
print("Free Memory: %ld MiB" % (free_mem / 1024**2))
result, stream = cuda.cuStreamCreate(0)

batch_size = 32
is_cuda = True

runs = 100
warm_up = int(runs * 0.25)
total = 0
start = time.time()

for i in tqdm(
    range(runs), desc="Benchmarking"
):
    if i == warm_up:
        if is_cuda:
            (result,) = cuda.cuStreamSynchronize(stream) # or cuda.cuCtxSynchronize() or cudart.cudaDeviceSynchronize()
        total = 0
        start = time.time()

    # model inference code
    total += batch_size

if is_cuda:
    (result,) = cuda.cuStreamSynchronize(stream)

end = time.time()
elapsed = end - start

throughput = total / elapsed

print(f"Throughput: {throughput:.2f} im/s")

(result,) = cuda.cuStreamDestroy(stream)
(result,) = cuda.cuCtxDestroy(context)

sbucaille commented 1 year ago

Hi @Nunah, Thanks for the suggestions. First, I can't implement as of now deepsparse as I don't have the sufficient hardware to experiment it. Second, I investigated the benchmark loop. It seems the Timer acts just like a loop with a start and end time. I added a cuda stream with warm-up instead of using a torch generated cuda stream, although I couldn't see any improvements in the stability of the measurements. Let me know if you notice a change on your side. Regarding other suggestions, I'll add it into my TODO list and I hope I can indeed implement diverse model architectures to this repository. link of the commit for reference

sbucaille / phd_template

Suggestions for improvement #1