Closed scotts closed 3 weeks ago
Can you add the cli you used to generate this data too?
Also it's weird to see the core being 2x slower than public for the random seek -- maybe it's because the random pts list is different for every decoder? Or we aren't running the benchmarks for long enough?
Benchmarks were run with:
python benchmarks/decoders/benchmark_decoders.py
Note that I changed the defaults of what benchmarks run when there are no options provided. Let me fix the call to random, and then we can look at the numbers again. I did increase the runtime for the README chart, but not for benchmark_decoders.
Benchmarks were run with:
python benchmarks/decoders/benchmark_decoders.py
Can you put this in the PR description just before the table so git log shows it?
I update the description with how I ran the benchmarks, and the top result from this batch. This is from calling the benchmarks four consecutive times:
[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
| uniform 10 seek()+next() | random 10 seek()+next() | 1 next() | 10 next() | 100 next() | create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
DecordNonBatchDecoderAccurateSeek | 55.1 | 119.1 | 12.7 | 18.4 | 62.0 |
TorchVision[backend=video_reader] | 349.0 | 327.3 | 12.7 | 16.0 | 44.2 |
TorchAudioDecoder | 467.3 | 513.7 | 8.0 | 15.3 | 73.1 |
TorchCodecCore: | 48.9 | 113.7 | 9.6 | 12.8 | 44.8 | 9.6
TorchCodecCore:num_threads=1 | 108.9 | 259.7 | 6.9 | 11.9 | 70.0 |
TorchCodecCoreCompiled | 47.8 | 110.1 | 9.7 | 13.3 | 49.3 |
TorchCodecCoreBatch | 49.3 | 43.3 | 11.5 | 14.5 | 61.8 |
TorchCodecPublic | 50.7 | 45.2 | 11.8 | 15.0 | 48.1 |
[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
| uniform 10 seek()+next() | random 10 seek()+next() | 1 next() | 10 next() | 100 next() | create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
DecordNonBatchDecoderAccurateSeek | 56.0 | 122.6 | 12.9 | 18.6 | 62.0 |
TorchVision[backend=video_reader] | 343.7 | 345.0 | 12.6 | 15.7 | 50.2 |
TorchAudioDecoder | 498.7 | 560.0 | 8.0 | 14.0 | 83.9 |
TorchCodecCore: | 48.9 | 111.9 | 9.5 | 12.8 | 59.1 | 9.6
TorchCodecCore:num_threads=1 | 111.2 | 263.4 | 6.9 | 12.0 | 72.6 |
TorchCodecCoreCompiled | 48.9 | 114.0 | 9.7 | 13.5 | 69.5 |
TorchCodecCoreBatch | 50.1 | 45.0 | 11.5 | 14.4 | 62.7 |
TorchCodecPublic | 50.4 | 46.1 | 11.6 | 15.1 | 60.9 |
[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
| uniform 10 seek()+next() | random 10 seek()+next() | 1 next() | 10 next() | 100 next() | create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
DecordNonBatchDecoderAccurateSeek | 52.8 | 117.8 | 12.9 | 16.3 | 68.8 |
TorchVision[backend=video_reader] | 361.6 | 337.1 | 12.8 | 16.0 | 51.5 |
TorchAudioDecoder | 472.7 | 518.9 | 8.1 | 15.6 | 84.5 |
TorchCodecCore: | 48.1 | 109.3 | 9.6 | 12.8 | 59.8 | 9.6
TorchCodecCore:num_threads=1 | 113.6 | 260.9 | 6.9 | 11.9 | 70.7 |
TorchCodecCoreCompiled | 48.4 | 109.2 | 9.7 | 13.4 | 64.8 |
TorchCodecCoreBatch | 49.1 | 43.7 | 11.5 | 14.6 | 61.9 |
TorchCodecPublic | 50.1 | 45.0 | 11.6 | 14.9 | 59.4 |
[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
| uniform 10 seek()+next() | random 10 seek()+next() | 1 next() | 10 next() | 100 next() | create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
DecordNonBatchDecoderAccurateSeek | 54.9 | 121.5 | 12.8 | 18.1 | 69.5 |
TorchVision[backend=video_reader] | 351.3 | 325.7 | 12.7 | 15.8 | 46.6 |
TorchAudioDecoder | 463.7 | 521.3 | 8.0 | 14.0 | 84.8 |
TorchCodecCore: | 47.4 | 110.4 | 9.6 | 12.7 | 58.5 | 9.5
TorchCodecCore:num_threads=1 | 112.1 | 271.0 | 6.9 | 11.9 | 69.0 |
TorchCodecCoreCompiled | 48.2 | 112.2 | 9.8 | 13.3 | 63.7 |
TorchCodecCoreBatch | 49.2 | 44.3 | 11.6 | 14.7 | 63.4 |
TorchCodecPublic | 49.4 | 43.8 | 11.6 | 14.8 | 59.6 |
The only major anomaly I see is how variable across runs TorchCodecCore and TorchCodecPublic are for 100 next. Curiously, they're both behaving about the same within a run.
This PR adds new benchmarks, changes existing benchmarks, and refactors the benchmark code itself. More specifically:
Running the benchmarks with the command:
On my dev machine yields:
I think we have an opportunity to improve performance here.
The current implementation of iterators for
VideoDecoder
use the indexing API (__getitem__()
). That is, we don't actually implement our own iterators. The Python language just uses__getitem__()
and__len__()
for us. That means we are seeking. If we implement our own iterators, we could do the same thing that TorchCodecCore is doing and get better performance with the public API.Also note that the README data and charts now use the public API, not the core. I think a principle we should stick to is that when we talk externally about performance, we should always talk about the performance of the public API.
Finally, #329 should merge before this PR. Then this PR should rebase on main and I need to rerun the script to generate the graph.