pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.55k stars 657 forks source link

StreamReader.add_basic_video_stream drops last frame if `frame_rate` is specified #3809

Open tyler-rt opened 4 months ago

tyler-rt commented 4 months ago

🐛 Describe the bug

Using add_basic_video_stream causes the last frame of a video to be erroneously dropped.

first download example.mp4 (177KB).

import torio
def read_video(file_path, frame_rate=25):
    reader = torio.io.StreamingMediaDecoder(file_path)
    reader.add_basic_video_stream(
        frames_per_chunk=-1,
        buffer_chunk_size=-1,
        decoder_option={"threads": "1"},
        frame_rate=frame_rate,
    )

    video_chunks = []

    for chunks in reader.stream():
        video_chunk = chunks
        if video_chunk is not None:
            video_chunks.extend(video_chunk)

    video_data = None
    if len(video_chunks) > 0:
        video_data = torch.cat(video_chunks, dim=0)

    return video_data

mp4_path = voxceleb_subset.iloc[0].videopath
video0 = read_video(mp4_path, frame_rate=None)
video1 = read_video(mp4_path, frame_rate=25)
print(f"original num_frames: {video0.shape[0]}, with hardcode frame_rate: {video1.shape[0]}")
video0 = video0.permute(0,2,3,1)
video1 = video1.permute(0,2,3,1)
import matplotlib.pyplot as plt
fix, axes = plt.subplots(2, 2, figsize=(10, 5))
ax = axes.flatten()
ax[0].imshow(video0[0]-video1[0])
ax[0].set_title("first frame diff")
ax[1].imshow(video0[-1]-video1[-1])
ax[1].set_title("last frame diff")
ax[2].imshow(video0[-2]-video1[-1])
ax[2].set_title("diff of new video last frame to original second to last")
# turn off all ticks and labels
for a in ax:
    a.axis("off")
# plt.suptitle(f"StreamingMediaDecoder drops last frame when frame_rate is specified\n{mp4_path}")
plt.suptitle(f"StreamingMediaDecoder drops last frame when frame_rate is specified\nexample.mp4")
plt.tight_layout()

original num_frames: 121, with hardcode frame_rate: 120 image

We can verify that the frame rate of 25fps is correct from mediainfo

❯ mediainfo example.mp4 
General
Complete name                            : example.mp4
Format                                   : MPEG-4
Format profile                           : Base Media
Codec ID                                 : isom (isom/iso2/avc1/mp41)
File size                                : 177 KiB
Duration                                 : 4 s 904 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 296 kb/s
Frame rate                               : 25.000 FPS
Writing application                      : Lavf57.83.100

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L1.2
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 4 frames
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 4 s 840 ms
Bit rate                                 : 229 kb/s
Width                                    : 224 pixels
Height                                   : 224 pixels
Display aspect ratio                     : 1.000
Frame rate mode                          : Constant
Frame rate                               : 25.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.182
Stream size                              : 135 KiB (76%)
Writing library                          : x264 core 152 r2854 e9a5903
Encoding settings                        : cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=7 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=250 / keyint_min=25 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=23.0 / qcomp=0.60 / qpmin=0 / qpmax=69 / qpstep=4 / ip_ratio=1.40 / aq=1:1.00
Codec configuration box                  : avcC

Audio
ID                                       : 2
Format                                   : AAC LC
Format/Info                              : Advanced Audio Codec Low Complexity
Codec ID                                 : mp4a-40-2
Duration                                 : 4 s 904 ms
Duration_LastFrame                       : -24 ms
Bit rate mode                            : Variable
Bit rate                                 : 62.7 kb/s
Maximum bit rate                         : 69.0 kb/s
Channel(s)                               : 1 channel
Channel layout                           : M
Sampling rate                            : 16.0 kHz
Frame rate                               : 15.625 FPS (1024 SPF)
Compression mode                         : Lossy
Stream size                              : 37.6 KiB (21%)
Default                                  : Yes
Alternate group                          : 1

Versions

$ python collect_env.py Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-6.5.0-1022-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G GPU 4: NVIDIA A10G GPU 5: NVIDIA A10G GPU 6: NVIDIA A10G GPU 7: NVIDIA A10G

Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 0 BogoMIPS: 5600.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.2 [pip3] pytorch-lightning==2.2.1 [pip3] torch==2.3.1 [pip3] torchaudio==2.3.1 [pip3] torchmetrics==1.3.1 [pip3] torchvision==0.18.1 [pip3] triton==2.3.1 [conda] torch 2.3.1 pypi_0 pypi [conda] torchaudio 2.3.1 pypi_0 pypi [conda] torchvision 0.18.1 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi