pytorch / torchcodec

PyTorch video decoding
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

Refactor and add benchmarks #330

Closed scotts closed 3 weeks ago

scotts commented 3 weeks ago

This PR adds new benchmarks, changes existing benchmarks, and refactors the benchmark code itself. More specifically:

  1. Refactors the benchmark code, and most obviously, changed a bunch of the decoder names. The names in the code now match more what we call them in conversation, and I think make both the code and the output more readable.
  2. Adds an experiment using random timestamps. The three experiments are now: random timestamps, uniform timestamps, and decode the first 100 frames.
  3. Adds a new decoder kind in the experiments: TorchCodecPublic. All of the pre-existing TorchCodec decoder kinds now start with TorchCodecCore. The naming is, I hope, obvious: one kind uses the public API, the other directly uses the core API.

Running the benchmarks with the command:

python benchmarks/decoders/benchmark_decoders.py

On my dev machine yields:

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            55.1            |           119.1           |    12.7    |     18.4    |     62.0     |                 
      TorchVision[backend=video_reader]  |           349.0            |           327.3           |    12.7    |     16.0    |     44.2     |                 
      TorchAudioDecoder                  |           467.3            |           513.7           |     8.0    |     15.3    |     73.1     |                 
      TorchCodecCore:                    |            48.9            |           113.7           |     9.6    |     12.8    |     44.8     |        9.6      
      TorchCodecCore:num_threads=1       |           108.9            |           259.7           |     6.9    |     11.9    |     70.0     |                 
      TorchCodecCoreCompiled             |            47.8            |           110.1           |     9.7    |     13.3    |     49.3     |                 
      TorchCodecCoreBatch                |            49.3            |            43.3           |    11.5    |     14.5    |     61.8     |                 
      TorchCodecPublic                   |            50.7            |            45.2           |    11.8    |     15.0    |     48.1     |                 

I think we have an opportunity to improve performance here.

The current implementation of iterators for VideoDecoder use the indexing API (__getitem__()). That is, we don't actually implement our own iterators. The Python language just uses __getitem__() and __len__() for us. That means we are seeking. If we implement our own iterators, we could do the same thing that TorchCodecCore is doing and get better performance with the public API.

Also note that the README data and charts now use the public API, not the core. I think a principle we should stick to is that when we talk externally about performance, we should always talk about the performance of the public API.

Finally, #329 should merge before this PR. Then this PR should rebase on main and I need to rerun the script to generate the graph.

ahmadsharif1 commented 3 weeks ago

Can you add the cli you used to generate this data too?

Also it's weird to see the core being 2x slower than public for the random seek -- maybe it's because the random pts list is different for every decoder? Or we aren't running the benchmarks for long enough?

scotts commented 3 weeks ago

Benchmarks were run with:

python benchmarks/decoders/benchmark_decoders.py

Note that I changed the defaults of what benchmarks run when there are no options provided. Let me fix the call to random, and then we can look at the numbers again. I did increase the runtime for the README chart, but not for benchmark_decoders.

ahmadsharif1 commented 3 weeks ago

Benchmarks were run with:

python benchmarks/decoders/benchmark_decoders.py

Can you put this in the PR description just before the table so git log shows it?

scotts commented 3 weeks ago

I update the description with how I ran the benchmarks, and the top result from this batch. This is from calling the benchmarks four consecutive times:

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            55.1            |           119.1           |    12.7    |     18.4    |     62.0     |                 
      TorchVision[backend=video_reader]  |           349.0            |           327.3           |    12.7    |     16.0    |     44.2     |                 
      TorchAudioDecoder                  |           467.3            |           513.7           |     8.0    |     15.3    |     73.1     |                 
      TorchCodecCore:                    |            48.9            |           113.7           |     9.6    |     12.8    |     44.8     |        9.6      
      TorchCodecCore:num_threads=1       |           108.9            |           259.7           |     6.9    |     11.9    |     70.0     |                 
      TorchCodecCoreCompiled             |            47.8            |           110.1           |     9.7    |     13.3    |     49.3     |                 
      TorchCodecCoreBatch                |            49.3            |            43.3           |    11.5    |     14.5    |     61.8     |                 
      TorchCodecPublic                   |            50.7            |            45.2           |    11.8    |     15.0    |     48.1     |                 

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            56.0            |           122.6           |    12.9    |     18.6    |     62.0     |                 
      TorchVision[backend=video_reader]  |           343.7            |           345.0           |    12.6    |     15.7    |     50.2     |                 
      TorchAudioDecoder                  |           498.7            |           560.0           |     8.0    |     14.0    |     83.9     |                 
      TorchCodecCore:                    |            48.9            |           111.9           |     9.5    |     12.8    |     59.1     |        9.6      
      TorchCodecCore:num_threads=1       |           111.2            |           263.4           |     6.9    |     12.0    |     72.6     |                 
      TorchCodecCoreCompiled             |            48.9            |           114.0           |     9.7    |     13.5    |     69.5     |                 
      TorchCodecCoreBatch                |            50.1            |            45.0           |    11.5    |     14.4    |     62.7     |                 
      TorchCodecPublic                   |            50.4            |            46.1           |    11.6    |     15.1    |     60.9     |                 

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            52.8            |           117.8           |    12.9    |     16.3    |     68.8     |                 
      TorchVision[backend=video_reader]  |           361.6            |           337.1           |    12.8    |     16.0    |     51.5     |                 
      TorchAudioDecoder                  |           472.7            |           518.9           |     8.1    |     15.6    |     84.5     |                 
      TorchCodecCore:                    |            48.1            |           109.3           |     9.6    |     12.8    |     59.8     |        9.6      
      TorchCodecCore:num_threads=1       |           113.6            |           260.9           |     6.9    |     11.9    |     70.7     |                 
      TorchCodecCoreCompiled             |            48.4            |           109.2           |     9.7    |     13.4    |     64.8     |                 
      TorchCodecCoreBatch                |            49.1            |            43.7           |    11.5    |     14.6    |     61.9     |                 
      TorchCodecPublic                   |            50.1            |            45.0           |    11.6    |     14.9    |     59.4     |                 

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            54.9            |           121.5           |    12.8    |     18.1    |     69.5     |                 
      TorchVision[backend=video_reader]  |           351.3            |           325.7           |    12.7    |     15.8    |     46.6     |                 
      TorchAudioDecoder                  |           463.7            |           521.3           |     8.0    |     14.0    |     84.8     |                 
      TorchCodecCore:                    |            47.4            |           110.4           |     9.6    |     12.7    |     58.5     |        9.5      
      TorchCodecCore:num_threads=1       |           112.1            |           271.0           |     6.9    |     11.9    |     69.0     |                 
      TorchCodecCoreCompiled             |            48.2            |           112.2           |     9.8    |     13.3    |     63.7     |                 
      TorchCodecCoreBatch                |            49.2            |            44.3           |    11.6    |     14.7    |     63.4     |                 
      TorchCodecPublic                   |            49.4            |            43.8           |    11.6    |     14.8    |     59.6     |                 

The only major anomaly I see is how variable across runs TorchCodecCore and TorchCodecPublic are for 100 next. Curiously, they're both behaving about the same within a run.