lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
902 stars 204 forks source link

Support for Video Features, for example How2Sign #1359

Open kerolos opened 1 week ago

kerolos commented 1 week ago

Extend Lhotse to support video features for tasks such as sign language recognition (e.g., How2Sign) and human activity recognition. This enhancement will be useful for the Icefall platform.

Details

With the recent support for video in PR #1151, I am interested in developing a new recipe to handle video data and extract features using tools like MediaPipe.

Objectives

  1. Recipe Addition:

    • Add a new recipe that supports video data in the lhotse/recipes directory.
  2. Feature Extraction:

Implementation Steps

  1. Create Manifest Files:

    • Recordings manifest (recordings.jsonl):

      {
      "id": "-fZc293MpJk_0-1-rgb_front",
      "sources": [
       {
         "type": "file",
         "channels": [0],
         "source": "/mnt/TB16/sign2text/dataset/How2Sign/clips/test_rgb_front_clips/raw_videos/-fZc293MpJk_0-1-rgb_front.mp4"
       }
      ],
      "sampling_rate": 24,
      "num_samples": 17,
      "duration": 6.53,
      "end": 6.53,
      "channel_ids": [0],
      "feature_path": "/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/data/original/raw_features/-fZc293MpJk_0-1-rgb_front.txt"
      }
    • Supervisions manifest (supervisions.jsonl):

      {
      "id": "-fZc293MpJk_0-1-rgb_front",
      "recording_id": "-fZc293MpJk_0-1-rgb_front",
      "start": 0.0,
      "end": 6.53,
      "duration": 6.53,
      "channel": 0,
      "text": "hi",
      "speaker": "-fZc293MpJk"
      }
  2. Feature Extraction Script:

    • Create a script compute_features_sign_language.py:

      import argparse
      import logging
      import os
      from pathlib import Path
      
      import torch
      import numpy as np
      from lhotse import CutSet, NumpyFilesWriter, load_manifest_lazy
      from tqdm import tqdm
      
      # Set the number of threads for torch to avoid performance issues
      torch.set_num_threads(1)
      torch.set_num_interop_threads(1)
      
      def get_args():
       parser = argparse.ArgumentParser(
           description="This script creates ssl features file for sign language dataset"
       )
       parser.add_argument(
           "--src-dir",
           type=str,
           help="Path to the data source",
       )
       parser.add_argument(
           "--output-dir",
           type=str,
           help="Output directory",
       )
       parser.add_argument(
           "--feature-dim",
           type=int,
           default=1662,
           help="Dimension of the feature vectors",
       )
       return parser.parse_args()
      
      def load_raw_features(feature_path, feature_dim):
       with open(feature_path, 'r') as f:
           raw_features = np.loadtxt(f)
       return raw_features.reshape(-1, feature_dim)
      
      def compute_sign_language_features(src_dir, output_dir, feature_dim):
       src_dir = Path(src_dir)
       output_dir = Path(output_dir)
       output_dir.mkdir(parents=True, exist_ok=True)
      
       recordings_manifest = load_manifest_lazy(src_dir / 'recordings.jsonl.gz')
       supervisions_manifest = load_manifest_lazy(src_dir / 'supervisions.jsonl.gz')
      
      #i am not sure how can be implemented 
       with tqdm(total=len(recordings_manifest)) as pbar:
           for recording in recordings_manifest:
               feature_path = recording["feature_path"]
               if os.path.exists(feature_path):
                   features = load_raw_features(feature_path, feature_dim)
               else:
                   # Here we should include the actual feature extraction logic if needed
                   raise FileNotFoundError(f"Feature file not found: {feature_path}")
      
               output_file = output_dir / f"{recording['id']}.npy"
               np.save(output_file, features)
               pbar.update(1)
      
      def main():
       logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s')
      
       args = get_args()
       src_dir = Path(args.src_dir)
       output_dir = Path(args.output_dir)
      
       compute_sign_language_features(src_dir, output_dir, args.feature_dim)
      
      if __name__ == "__main__":
       main()

Questions

  1. Is there a plan to add a recipe that supports video data in Lhotse?
  2. How can I start using customized features, for example, using MediaPipe Pose Estimation tools?
  3. What format should be used to save the extracted features (i.has_features ) and saved as features and load them later for training ? 4- i have also Frames per second, it is not always fixed it in between 24 fps to 50 fps , how can i deal with that ?

I would appreciate any guidance or support on implementing this feature and utilizing it within the Icefall platform @pzelasko .

Thank you!

pzelasko commented 1 week ago

Hi @kerolos, thanks for opening this discussion! I can help you get your video recipe set up.

Is there a plan to add a recipe that supports video data in Lhotse?

There is one AV recipe currently for GRID AV corpus: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/grid.py In general, lhotse recipes download and prepare the manifests for datasets, but actual training is out of lhotse's scope. You may want to set up a separate repository with your experiment's code that imports lhotse.

How can I start using customized features, for example, using MediaPipe Pose Estimation tools?

Once you create a recording, you can load the video, process it with some module, and save + attach as a custom field to the cut. For example:

video_recording = Recording.from_file("/path/to/-fZc293MpJk_0-1-rgb_front.mp4")  # lhotse will auto-construct video recording manifest
video_cut = video_recording.to_cut()

video_frames = video_cut.load_video()  # video frames is a uint8 np.array with shape (T, C, H, W) [or some other permutation, I don't remember off the top of my head]
video_features = compute_some_features(video_frames)  # video_features is np.array with arbitrary shape

# Option 1 -> save to some storage directly
# temporal_dim indicates which dimension in video_features shape corresponds to time; set accordingly.
with NumpyHdf5Writer("video_features.h5") as writer:
    video_cut.video_features = writer.store_array(video_cut.id, video_features, frame_shift=video_recording.video.fps, temporal_dim=0)

# Option 2 -> holds data in memory, write to some storage later (useful if you're going to use Lhotse Shar format):
video_cut = video_cut.attach_tensor("video_features", video_features, frame_shift=video_recording.video.fps, temporal_dim=0)  

If you save the final video_cut, you can then later load video_features with cut.load_video_features() and access the manifest via cut.video_features (special field and method are auto-added for custom fields registered via attach_tensor). You can compute many different features and attach all of them under different names.

What format should be used to save the extracted features (i.has_features ) and saved as features and load them later for training ?

I would use one of numpy format writers in lhotse (e.g. NumpyHdf5Writer in the example above). Don't use lilcom unless you are sure it makes sense (it is a lossy format optimized for log-domain features). You may also want to explore lhotse shar format which I think should work with video recordings (and definitely works with video features extracted as above). It is better optimized for I/O which might help you process large video data in training.

That said, video features would likely require better compression for very large datasets, which is something we can explore later.

i have also Frames per second, it is not always fixed it in between 24 fps to 50 fps , how can i deal with that ?

You can access the fps via recording.video.fps or cut.video.fps. If you want to resample the video, you have two options: 1) either load the whole thing / cut of a given duration, and downsample/resample then in python; 2) leverage torchaudio ffmpeg bindings to resample the video (you might need to check out their tutorials to learn how to pass specific ffmpeg transform commands and find a way to expose/add it in AudioSource API. For reference, this code loads the video)

Final comment, Recording manifest doesn't support custom fields, so you'd be better off moving feature_path key to supervision as {..., "custom": {"feature_path": ...}}

kerolos commented 18 hours ago

Thank you very much @pzelasko for your reply and support.

I have used the first option (Option 1 -> save to some storage directly) to save video features in a .h5 file.

I used the following Python code to read the images from manifests and save it in a large single file .h5:

import argparse
import logging
import time
from pathlib import Path
import cv2
import numpy as np
import mediapipe as mp
from tqdm import tqdm
from lhotse import Recording, CutSet, load_manifest_lazy
from lhotse.features.io import NumpyHdf5Writer
import json

def read_video_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        logging.error(f"Failed to open video file, video path: {video_path}")
        return []

    video_frames = []
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        video_frames.append(frame_rgb)

    cap.release()
    return video_frames

def extract_features_from_video(video_path, holistic):
    start_time = time.time()
    video_frames = read_video_frames(video_path)
    read_time = time.time()
    logging.info(f"Time to read video frames: {read_time - start_time:.2f} seconds")

    if len(video_frames) == 0:
        return None
   #Total feature per frame equall (feature_dim )= 33*4 + 468*3 +  21*3 + 21*3 = 1662
    keypoints = []
    for frame in video_frames:
        frame.flags.writeable = False
        results = holistic.process(frame)
        frame.flags.writeable = True

        pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
        face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if results.face_landmarks else np.zeros(468*3)
        lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
        rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)

        keypoints.append(np.concatenate([pose, face, lh, rh]))
    extract_time = time.time()
    logging.info(f"Time to extract features: {extract_time - read_time:.2f} seconds")

    return np.array(keypoints)

def preprocess_manifest(manifest_path, temp_path, remove_keys):
    with open(manifest_path, 'r') as infile, open(temp_path, 'w') as outfile:
        for line in infile:
            data = json.loads(line)
            for key in remove_keys:
                data.pop(key, None)
            outfile.write(json.dumps(data) + '\n')

def compute_sign_language_features(language, src_manifests, output_dir):
    src_dir = Path(src_manifests)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    recordings_manifest = src_dir / f'{language}_recordings_train.jsonl'
    supervisions_manifest = src_dir / f'{language}_supervisions_train.jsonl'

    if not recordings_manifest.exists():
        logging.error(f"Temporary recordings manifest not found: {recordings_manifest}")
        return

    if not supervisions_manifest.exists():
        logging.error(f"Temporary supervisions manifest not found: {supervisions_manifest}")
        return

    recordings_manifest = load_manifest_lazy(recordings_manifest)
    supervisions_manifest = load_manifest_lazy(supervisions_manifest)

    hdf5_path = output_dir / "video_features.h5"
    with NumpyHdf5Writer(hdf5_path) as writer, tqdm(total=len(recordings_manifest)) as pbar, mp.solutions.holistic.Holistic(static_image_mode=False, model_complexity=0, min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
        for recording in recordings_manifest:
            try:
                video_recording = Recording.from_dict(recording.to_dict())
                video_cut = video_recording.to_cut()
                video_path = video_recording.sources[0].source
                logging.info(f"Loading video frames from {video_path}")

                video_features = extract_features_from_video(video_path, holistic)
                if video_features is None:
                    logging.error(f"Failed to load video frames for recording ID: {video_recording.id}, video path: {video_path}")
                    continue

                start_time = time.time()
                writer.store_array(video_cut.id, video_features, frame_shift=video_recording.sampling_rate, temporal_dim=0)
                read_time = time.time()
                logging.info(f"Time to write a video in H5: {read_time - start_time:.2f} seconds")

            except Exception as e:
                logging.error(f"Error processing recording ID {recording.id}: {e}")
            pbar.update(1)

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--language", type=str, required=True, help="Language to process.")
    parser.add_argument("--src-manifests", type=str, required=True, help="Path to the source directory containing manifest files.")
    parser.add_argument("--output-dir", type=str, required=True, help="Path to the output directory to save the extracted features.")
    return parser.parse_args()

if __name__ == "__main__":
    formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
    logging.basicConfig(format=formatter, level=logging.INFO)

    args = get_args()
    logging.info(vars(args))
    compute_sign_language_features(language=args.language, src_manifests=args.src_manifests, output_dir=args.output_dir)

However, I got an issue when trying to read this .h5 file as a cut:

reader = NumpyHdf5Reader("data/mediapipe_raw/video_features.h5")

I have used those helper functions to verify the file H5:

import h5py
from lhotse import CutSet, MonoCut

def inspect_hdf5_file(hdf5_path):
    with h5py.File(hdf5_path, 'r') as f:
        keys = list(f.keys())
        logging.info(f"Keys in HDF5 file: {keys}")
        return keys

def load_cuts_from_hdf5(hdf5_path, prefix=""):
    cuts = []
    with h5py.File(hdf5_path, 'r') as f:
        for key in f.keys():
            if prefix in key:
                data = f[key][:]
                logging.info(f"Loaded data for key {key}: shape={data.shape}")
                cut = MonoCut(id=key, start=0.0, duration=len(data) / 100.0, channel=0, features=None)
                cuts.append(cut)
    logging.info(f"Total cuts loaded: {len(cuts)}")
    return CutSet.from_cuts(cuts)

     # I loaded like that 
   @lru_cache()
    def test_cuts(self) -> CutSet:
        logging.info("About to get test cuts")
        #reader = NumpyHdf5Reader(self.args.manifest_dir / self.args.test_manifest)
        #return CutSet.from_hdf5(reader.hdf, prefix="test")
        hdf5_path = "/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/data/mediapipe_raw/video_features.h5"
        keys = inspect_hdf5_file(hdf5_path)
        cuts = load_cuts_from_hdf5(hdf5_path, prefix="test")
        logging.info(f"Total test cuts: {len(cuts)}")
        return cuts

Error:

2024-07-04 22:12:03,466 INFO [train.py:1102] Training started
2024-07-04 22:12:03,471 INFO [train.py:1112] Device: cuda:0
2024-07-04 22:12:03,488 INFO [train.py:1124] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.ddde5bd.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/home/kerolos/projects/asr/icefall', 'k2-path': '/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'kerolos', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/exp/models/model_zipformer'), 'bpe_model': '/mnt/TB16/sign2text/train_SignModel/en/20_06_2024//exp//lang/bpe.model', 'base_lr': 0.03, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/data/mediapipe_raw'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'video_features.h5', 'dev_manifest': 'video_features.h5', 'test_manifest': 'kaldi_cuts_test.jsonl.gz', 'blank_id': 0, 'vocab_size': 2000}
2024-07-04 22:12:03,505 INFO [train.py:1126] About to create model
2024-07-04 22:12:04,187 INFO [train.py:1130] Number of model parameters: 69187431
2024-07-04 22:12:07,527 INFO [signlang_datamodule.py:337] About to get train cuts
2024-07-04 22:12:07,532 INFO [signlang_datamodule.py:50] Keys in HDF5 file: ['-fZc293MpJk_0-1-rgb_front', '-fZc293MpJk_2-1-rgb_front', '-fZc293MpJk_3-1-rgb_front', '-fZc293MpJk_4-1-rgb_front', '-fZc293MpJk_5-1-rgb_front',

2024-07-04 22:12:07,542 INFO [signlang_datamodule.py:62] Total cuts loaded: 0
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:343] Total train cuts: 0
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:234] About to create train dataset
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:242] Using DynamicBucketingSampler.
/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py:136: UserWarning: You are using DynamicBucketingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' to read the CutSet lazily, or use a BucketingSampler instead.
  warnings.warn(
Traceback (most recent call last):
  File "./zipformer/train.py", line 1386, in <module>
    main()
  File "./zipformer/train.py", line 1379, in main
    run(rank=0, world_size=1, args=args)
  File "./zipformer/train.py", line 1226, in run
    train_dl = signData.train_dataloaders(
  File "SignRcg/zipformer/signlang_datamodule.py", line 243, in train_dataloaders
    train_sampler = DynamicBucketingSampler(
  File "anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py", line 181, in __init__
    self.duration_bins = estimate_duration_buckets(
  File "anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py", line 323, in estimate_duration_buckets
    assert num_buckets <= sizes.shape[0], (
AssertionError: The number of buckets (30) must be smaller than or equal to the number of cuts (0).

I would like also to add text "Referance text" from supervisions.jsonl to H5 ? , How can I save a part of it in a file (human-readable format) , the total Feature vector per frame is 1662 (shall i change this temporal_dim accordingly ) ?

writer.store_array(video_cut.id, video_features, frame_shift=video_recording.sampling_rate, temporal_dim=0)

Could you please provide guidance on how to properly read this .h5 file as a cut and resolve the issue?

Thanks in advance