Update Caching logic to only trigger on the first inference sample

When the model cache is already set up, there is no need to call setup_caches each time a sample is passed in. This is normally fine, but torchtune is noisy (as it should) when setup_cache is unnecessarily called.

This just adds a check for first sample

Warnings that are now missing

Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.

Generation after fix (no warning)

python torchchat.py generate llama3.2-11B --prompt "What's in this image?" --image-prompt assets/dog.jpg  --num-samples 2

Note: NumExpr detected 22 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.6.0.dev20241002+cu121 available.
lm_eval is not installed, GPTQ may not be usable
Using device=cuda NVIDIA PG509-210
Loading model...
Time to load model: 10.45 seconds
-----------------------------------------------------------
What's in this image?The image features a dog sitting on a skateboard with its tongue out, sporting sunglasses. The dog has a white chest with brown ears and a brown patch of fur between its eyes and nose. It wears a blue collar and red sunglasses. The skateboard is red and yellow, with two yellow wheels on either side, and the dog appears to be sitting on top of it while facing the camera. The background of the image is blurry but seems to feature a paved road lined with green grass and trees.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 99 tokens
Time for inference 1: 17.3737 sec total
Time to first token: 2.5819 sec with parallel prefill.

      Total throughput: 5.7558 tokens/sec, 0.1737 s/token
First token throughput: 0.3873 tokens/sec, 2.5819 s/token
 Next token throughput: 6.6929 tokens/sec, 0.1494 s/token

Bandwidth achieved: 122.55 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

What's in this image?The image depicts a medium-sized white dog sitting on a red skateboard on an asphalt path. The dog has brown ears and a tan patch over one eye, giving it a slightly inquisitive appearance. Its tongue is protruding slightly from its mouth, which is slightly open, suggesting that the dog may be panting or playing along with the photo.

The dog is wearing red-framed sunglasses with black lenses, an alternative to a pair of goggles, and a blue collar. The skateboard features yellow wheels and has the word "CRAZ" written on the underside. The dog's body is facing forward, but it's looking toward the camera with its head turned slightly to the side, as if posing.

The background of the image shows a green grassy area and a hedge or bush behind it. The overall atmosphere suggests that the dog is enjoying a fun day out, possibly on a sunny day, and is ready to take a ride on its skateboard. The image is likely intended to be
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 199 tokens
Time for inference 2: 32.2208 sec total
Time to first token: 1.3305 sec with parallel prefill.

      Total throughput: 6.2072 tokens/sec, 0.1611 s/token
First token throughput: 0.7516 tokens/sec, 1.3305 s/token
 Next token throughput: 6.4421 tokens/sec, 0.1552 s/token

Bandwidth achieved: 132.16 GB/s

========================================

pytorch / torchchat

Update Caching logic to only trigger on the first inference sample #1369

This just adds a check for first sample

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1369

:white_check_mark: No Failures