Speed-up time-based samplers by 20X and index-based by 1.5X

Fixes https://github.com/pytorch/torchcodec/issues/256

We now let the samplers rely on our C++ "sort and dedup" logic, instead of the less efficient Python ones. This has a few benefits:

we can avoid extra copies
samplers can now return a 5D FrameBatch instead of a list of 4D FrameBatch. The 5D FrameBatch output is a "batch" of clips. Its data is of shape (num_clips, num_frames_per_clips, C, H, W) (or HWC).
the dedup logic now works efficicently for time based samplers (i.e. this fixes https://github.com/pytorch/torchcodec/issues/256)

Running our samplers benchmark:

On main:
----------
num_clips = 1
clips_at_random_indices     med = 19.09ms +- 4.24
clips_at_regular_indices    med = 10.75ms +- 1.10
clips_at_random_timestamps  med = 17.47ms +- 4.45
clips_at_regular_timestamps med = 17.29ms +- 4.10
----------
num_clips = 50
clips_at_random_indices     med = 144.86ms +- 10.09
clips_at_regular_indices    med = 162.97ms +- 40.31
clips_at_random_timestamps  med = 2332.83ms +- 426.51
clips_at_regular_timestamps med = 1871.10ms +- 351.33

This PR:
----------
num_clips = 1
clips_at_random_indices     med = 15.27ms +- 4.07
clips_at_regular_indices    med = 8.70ms +- 1.33
clips_at_random_timestamps  med = 16.34ms +- 4.40
clips_at_regular_timestamps med = 9.57ms +- 3.69
----------
num_clips = 50
clips_at_random_indices     med = 97.06ms +- 3.90
clips_at_regular_indices    med = 107.23ms +- 2.73
clips_at_random_timestamps  med = 104.52ms +- 3.49
clips_at_regular_timestamps med = 117.30ms +- 5.80

:rocket:

pytorch / torchcodec

Speed-up time-based samplers by 20X and index-based by 1.5X #284