🐛 Different metadata frame count and decoded frame count

MuhammadWaseem-DevOps commented 5 months ago

What's happening?

I'm using vision camera to record video and get timestamp of each frame. It's working fine as it should be. But the issue is when I try to use that video file for analysis at server side, the frames count in video's metadata file and actual encoded frames doesn't match. Encoded frames are always less than the frame count in video's metadata file. As vision camera recording working fine, I'm adding python scrip that I'm using to find the both frame counts.

Reproduceable Code

<Camera
      ref={camera}
      device={device}
      video={true}
      audio={true}
      zoom={0}
      fps={30}
      style={styles.preview}
      isActive={true}
      onInitialized={() => updateIsInitialized(true)}
    >

=====================
// Python script

import av
import subprocess
import json

def get_ffprobe_frame_info(video_path: str):
    """Gets frame information using ffprobe."""
    cmd = [
        'ffprobe',
        '-v', 'error',
        '-select_streams', 'v:0',
        '-show_entries', 'frame=pkt_pts_time,pkt_dts_time,pict_type',
        '-of', 'json',
        video_path
    ]
    result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    frames_info = json.loads(result.stdout)
    return frames_info['frames']

def validate_frames_metadata(video_path: str) -> bool:
    """Checks if the number of decoded frames matches the file's metadata and logs details."""
    with av.open(video_path) as container:
        video_stream = container.streams.video[0]
        n_frames_metadata = video_stream.frames

        n_decoded_frames = 0
        keyframe_count = 0
        corrupted_frames = 0

        for frame in container.decode(video=0):
            try:
                # Check if the frame is a keyframe
                if frame.key_frame:
                    keyframe_count += 1
                n_decoded_frames += 1
            except av.AVError as e:
                corrupted_frames += 1
                print(f"Error decoding frame {n_decoded_frames}: {e}")

        print(f"Metadata frame count: {n_frames_metadata}")
        print(f"Decoded frame count: {n_decoded_frames}")

    return n_frames_metadata == n_decoded_frames

# Example usage
video_path = "/folderPath/video.mp4"  # Change this to your video file path
if not validate_frames_metadata(video_path):
    print("Frame count discrepancy detected.")
else:
    print("Frame counts match.")

// Sample out put from the script
Metadata frame count: 336
Decoded frame count: 334
Frame count discrepancy detected.

Relevant log output

VisionCamera.didSetProps(_:): Updating 16 props: [isActive, fps, onViewReady, zoom, cameraId, flex, onError, onStarted, onInitialized, audio, enableBufferCompression, onStopped, onCodeScanned, enableFrameProcessor, pixelFormat, video]
VisionCamera.configure(_:): configure { ... }: Waiting for lock...
VisionCamera.configure(_:): configure { ... }: Updating CameraSession Configuration... Difference(inputChanged: true, outputsChanged: true, videoStabilizationChanged: true, orientationChanged: true, formatChanged: true, sidePropsChanged: true, torchChanged: true, zoomChanged: true, exposureChanged: true, audioSessionChanged: true)
VisionCamera.configureDevice(configuration:): Configuring Input Device...
VisionCamera.configureDevice(configuration:): Configuring Camera com.apple.avfoundation.avcapturedevice.built-in_video:0...
VisionCamera.configureDevice(configuration:): Successfully configured Input Device!
VisionCamera.configureOutputs(configuration:): Configuring Outputs...
VisionCamera.configureOutputs(configuration:): Adding Video Data output...
VisionCamera.configureOutputs(configuration:): Successfully configured all outputs!
VisionCamera.onCameraStarted(): Camera started!
VisionCamera.onSessionInitialized(): Camera initialized!
VisionCamera.configure(_:): Beginning AudioSession configuration...
VisionCamera.configureAudioSession(configuration:): Configuring Audio Session...
VisionCamera.configureAudioSession(configuration:): Adding Audio input...
VisionCamera.configureAudioSession(configuration:): Adding Audio Data output...
VisionCamera.configure(_:): Committed AudioSession configuration!
VisionCamera.startRecording(options:onVideoRecorded:onError:): Starting Video recording...
VisionCamera.startRecording(options:onVideoRecorded:onError:): Will record to temporary file: /private/var/mobile/Containers/Data/Application/A65DF777-0E30-49B7-B843-1441336C5E2C/tmp/ReactNative/15E80A29-B005-4DD1-B5B4-595D1D4836E8.mov
VisionCamera.startRecording(options:onVideoRecorded:onError:): Enabling Audio for Recording...
VisionCamera.activateAudioSession(): Activating Audio Session...
VisionCamera.updateCategory(_:mode:options:): Changing AVAudioSession category from AVAudioSessionCategorySoloAmbient -> AVAudioSessionCategoryPlayAndRecord
VisionCamera.initializeAudioWriter(withSettings:format:): Initializing Audio AssetWriter with settings: ["AVEncoderQualityForVBRKey": 91, "AVSampleRateKey": 44100, "AVFormatIDKey": 1633772320, "AVEncoderBitRateStrategyKey": AVAudioBitRateStrategy_Variable, "AVEncoderBitRatePerChannelKey": 96000, "AVNumberOfChannelsKey": 1]
VisionCamera.updateCategory(_:mode:options:): AVAudioSession category changed!
VisionCamera.initializeAudioWriter(withSettings:format:): Initialized Audio AssetWriter.
VisionCamera.initializeVideoWriter(withSettings:): Initializing Video AssetWriter with settings: ["AVVideoHeightKey": 1920, "AVVideoWidthKey": 1080, "AVVideoCompressionPropertiesKey": {
    AllowFrameReordering = 1;
    AllowOpenGOP = 1;
    AverageBitRate = 7651584;
    ExpectedFrameRate = 30;
    MaxAllowedFrameQP = 41;
    MaxKeyFrameIntervalDuration = 1;
    MinAllowedFrameQP = 15;
    MinimizeMemoryUsage = 1;
    Priority = 80;
    ProfileLevel = "HEVC_Main_AutoLevel";
    RealTime = 1;
    RelaxAverageBitRateTarget = 1;
}, "AVVideoCodecKey": hvc1]
VisionCamera.initializeVideoWriter(withSettings:): Initialized Video AssetWriter.
VisionCamera.start(clock:): Starting Asset Writer(s)...
VisionCamera.activateAudioSession(): Audio Session activated!
VisionCamera.start(clock:): Asset Writer(s) started!
VisionCamera.start(clock:): Started RecordingSession at time: 199749.042035833
VisionCamera.startRecording(options:onVideoRecorded:onError:): RecordingSesssion started in 726.891417ms!
VisionCamera.stop(clock:): Requesting stop at 199752.018624458 seconds for AssetWriter with status "writing"...
VisionCamera.appendBuffer(_:clock:type:startFrameTimestamp:midFrameTimestamp:endFrameTimestamp:): Successfully appended last audio Buffer (at 199752.03920833333 seconds), finishing RecordingSession...
VisionCamera.finish(): Stopping AssetWriter with status "writing"...
VisionCamera.startRecording(options:onVideoRecorded:onError:): RecordingSession finished with status completed.
VisionCamera.deactivateAudioSession(): Deactivating Audio Session...
VisionCamera.deactivateAudioSession(): Audio Session deactivated!

Camera Device

{
  "id": "com.apple.avfoundation.avcapturedevice.built-in_video:0",
  "formats": [],
  "sensorOrientation": "landscape-right",
  "minZoom": 1,
  "supportsLowLightBoost": false,
  "maxExposure": 8,
  "supportsFocus": true,
  "physicalDevices": [
    "wide-angle-camera"
  ],
  "supportsRawCapture": false,
  "neutralZoom": 1,
  "minExposure": -8,
  "name": "Back Camera",
  "hasFlash": false,
  "minFocusDistance": 12,
  "maxZoom": 16,
  "hasTorch": false,
  "hardwareLevel": "full",
  "position": "back",
  "isMultiCam": false
}

Device

iPad Air (4th generation) iOS 17.4.1 (21E236)

VisionCamera Version

3.9.0

Can you reproduce this issue in the VisionCamera Example app?

Yes, I can reproduce the same issue in the Example app here

Additional information

[ ] I am using Expo
[ ] I have enabled Frame Processors (react-native-worklets-core)
[X] I have read the Troubleshooting Guide
[x] I agree to follow this project's Code of Conduct
[X] I searched for similar issues in this repository and found none.

MuhammadWaseem-DevOps commented 5 months ago

Anyone? having any idea about this issue and any possible fix?

mrousavy commented 4 months ago

Hey!

I just spent a few days on thinking about a battleproof timestamp synchronization solution, and I came up with a great idea. I built a TrackTimeline helper class which represents a video or audio track - it can be started & stopped, paused & resumed, and even supports nesting pauses without issues.

The total duration of the video is summed up from the difference between the first and the last actually written timestamps, minus the total duration of all pauses between a video. No more incorrect video.duration! 🥳
Whereas before I just had a 4 second timeout if no frames arrive, I now just wait twice the frame latency (a few milliseconds) to ensure no frames are left out at maximum! 🎉
A video can be stopped while it is paused without any issues, as a pause call is taken into consideration before stopping 💪
A video file's session now exactly starts at the start() timestamp, and ends at the exact timestamp of the last video frame - this ensures there can never be any blank frames in the video, even if the audio track is longer 🤩

This was really complex to built as I had to synchronize timestamps between capture sessions, and the entire thing is a producer model - a video buffer can come like a second or so later than the audio buffer, but I need to make sure the video track starts before the audio track starts, and ends after the audio track ends - that's a huge brainf*ck! 🤯😅

There's also no helper APIs for this on iOS, and it looks like no other Camera framework (not even native Swift/ObjC iOS Camera libraries) support this - they all break when timestamps have a delay (e.g. video stabilization enabled) (or dont even support delays at all) ; so I had to build the thing myself.

Check out this PR and try if it fixes the issue for you; https://github.com/mrousavy/react-native-vision-camera/pull/2948

Thanks! ❤️

mrousavy commented 4 months ago

I just re-read what you said and it sounds actually intentional - there are situations where a few more frames are being encoded into the video.

This is to ensure the video is longer than the audio, but the video metadata has a flag that specifies the actual duration of the track session - this might cut off a few frames in the start or end.

See AVAssetWriter.startSession(atSourceTime:) / AVAssetWriter.endSession(atSourceTime:)

mrousavy commented 4 months ago

I think this is either fixed now in 4.2.0, or intentional (depending on the comment above). https://github.com/mrousavy/react-native-vision-camera/releases/tag/v4.2.0

MuhammadWaseem-DevOps commented 4 months ago

Thank you @mrousavy for the detailed comments and the new logic for accurate duration calculation. However, I'm still encountering the issue I mentioned earlier. Let me provide a more detailed explanation:

Video recording starts.
Each call to let successful = assetWriterInput.append(buffer) returns success, and I am counting the frames written to the file at this point.
Video recording stops.

For example, the count is 184 because assetWriterInput.append(buffer) was called successfully 184 times. The video file's metadata also reflects 184 frames. However, when I decode the recorded file using the Python script I mentioned in my first comment, it shows 183 frames or sometimes 182 frames. The decoded frame count is always less than the number of frames actually written to the file.

Could you suggest a way to fix this discrepancy? I have even tried excluding frames that are before the video starting timestamp by returning false in the start case of the events (if timestamp < event.timestamp).

I need the metadata file frame count to match the decoded frame count because I am recording the timestamp of each frame for later video analysis. The timestamps are recorded in a separate JSON file. So, when 184 buffers are appended, the timestamp count is also 184. But with only 183 or 182 decoded frames, there is a mismatch, and it's unclear which frame was dropped or skipped (whether at the start, middle, or end).

Any assistance to resolve this issue would be greatly appreciated. Thanks!

mrousavy commented 4 months ago

I think this shouldn't be changed in VisionCamera, but rather in your Python script.

VisionCamera just does add a few frames before or after the video to make sure there are no blanks (because if an audio sample comes after a video sample, it will be a blank frame in the resulting video).

So I guess you just need to make sure to decode only the Frames that are actually within the time range of the track duration.

MuhammadWaseem-DevOps commented 4 months ago

The decoded frames are always fewer than the frames added by the Vision Camera, never more. I have tried using the ffmpeg -i video.mp4 thumb%04d.jpg -hide_banner command, and the result is the same as with the Python script.

Additionally, the video recorded by Expo Camera does not exhibit any frame discrepancies. I have also tested some random recorded videos from other sources, and none of them show any frame differences.

Do you think this issue can be fixed on the Vision Camera side? Any help would be greatly appreciated.

mrousavy / react-native-vision-camera