Set thread_count to a value set by FFMPEG in single-video benchmarks and to 1 in concurrent benchmarks

Set ffmpeg thread count to 0 for single video benchmarks. This should saturate the system.
Set ffmpeg thread count to 1 for concurrent benchmarks. This should saturate the system because we have concurrency at the layer above the decoder.
Call concurrent benchmarks "concurrent" instead of "dataloader" as they don't technically use the pytorch dataloader.
Print the benchmark that's about to be run on the screen. This is only about 10 lines of output and makes it clear which benchmark takes a long time to finish.

pytorch / torchcodec