Open kerolos opened 1 week ago
Hi @kerolos, thanks for opening this discussion! I can help you get your video recipe set up.
Is there a plan to add a recipe that supports video data in Lhotse?
There is one AV recipe currently for GRID AV corpus: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/grid.py In general, lhotse recipes download and prepare the manifests for datasets, but actual training is out of lhotse's scope. You may want to set up a separate repository with your experiment's code that imports lhotse.
How can I start using customized features, for example, using MediaPipe Pose Estimation tools?
Once you create a recording, you can load the video, process it with some module, and save + attach as a custom field to the cut. For example:
video_recording = Recording.from_file("/path/to/-fZc293MpJk_0-1-rgb_front.mp4") # lhotse will auto-construct video recording manifest
video_cut = video_recording.to_cut()
video_frames = video_cut.load_video() # video frames is a uint8 np.array with shape (T, C, H, W) [or some other permutation, I don't remember off the top of my head]
video_features = compute_some_features(video_frames) # video_features is np.array with arbitrary shape
# Option 1 -> save to some storage directly
# temporal_dim indicates which dimension in video_features shape corresponds to time; set accordingly.
with NumpyHdf5Writer("video_features.h5") as writer:
video_cut.video_features = writer.store_array(video_cut.id, video_features, frame_shift=video_recording.video.fps, temporal_dim=0)
# Option 2 -> holds data in memory, write to some storage later (useful if you're going to use Lhotse Shar format):
video_cut = video_cut.attach_tensor("video_features", video_features, frame_shift=video_recording.video.fps, temporal_dim=0)
If you save the final video_cut
, you can then later load video_features with cut.load_video_features()
and access the manifest via cut.video_features
(special field and method are auto-added for custom fields registered via attach_tensor). You can compute many different features and attach all of them under different names.
What format should be used to save the extracted features (i.has_features ) and saved as features and load them later for training ?
I would use one of numpy format writers in lhotse (e.g. NumpyHdf5Writer
in the example above). Don't use lilcom unless you are sure it makes sense (it is a lossy format optimized for log-domain features). You may also want to explore lhotse shar format which I think should work with video recordings (and definitely works with video features extracted as above). It is better optimized for I/O which might help you process large video data in training.
That said, video features would likely require better compression for very large datasets, which is something we can explore later.
i have also Frames per second, it is not always fixed it in between 24 fps to 50 fps , how can i deal with that ?
You can access the fps via recording.video.fps
or cut.video.fps
. If you want to resample the video, you have two options: 1) either load the whole thing / cut of a given duration, and downsample/resample then in python; 2) leverage torchaudio ffmpeg bindings to resample the video (you might need to check out their tutorials to learn how to pass specific ffmpeg transform commands and find a way to expose/add it in AudioSource
API. For reference, this code loads the video)
Final comment, Recording
manifest doesn't support custom fields, so you'd be better off moving feature_path
key to supervision as {..., "custom": {"feature_path": ...}}
Thank you very much @pzelasko for your reply and support.
I have used the first option (Option 1 -> save to some storage directly) to save video features in a .h5 file.
import argparse
import logging
import time
from pathlib import Path
import cv2
import numpy as np
import mediapipe as mp
from tqdm import tqdm
from lhotse import Recording, CutSet, load_manifest_lazy
from lhotse.features.io import NumpyHdf5Writer
import json
def read_video_frames(video_path):
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
logging.error(f"Failed to open video file, video path: {video_path}")
return []
video_frames = []
while True:
ret, frame = cap.read()
if not ret:
break
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
video_frames.append(frame_rgb)
cap.release()
return video_frames
def extract_features_from_video(video_path, holistic):
start_time = time.time()
video_frames = read_video_frames(video_path)
read_time = time.time()
logging.info(f"Time to read video frames: {read_time - start_time:.2f} seconds")
if len(video_frames) == 0:
return None
#Total feature per frame equall (feature_dim )= 33*4 + 468*3 + 21*3 + 21*3 = 1662
keypoints = []
for frame in video_frames:
frame.flags.writeable = False
results = holistic.process(frame)
frame.flags.writeable = True
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
keypoints.append(np.concatenate([pose, face, lh, rh]))
extract_time = time.time()
logging.info(f"Time to extract features: {extract_time - read_time:.2f} seconds")
return np.array(keypoints)
def preprocess_manifest(manifest_path, temp_path, remove_keys):
with open(manifest_path, 'r') as infile, open(temp_path, 'w') as outfile:
for line in infile:
data = json.loads(line)
for key in remove_keys:
data.pop(key, None)
outfile.write(json.dumps(data) + '\n')
def compute_sign_language_features(language, src_manifests, output_dir):
src_dir = Path(src_manifests)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
recordings_manifest = src_dir / f'{language}_recordings_train.jsonl'
supervisions_manifest = src_dir / f'{language}_supervisions_train.jsonl'
if not recordings_manifest.exists():
logging.error(f"Temporary recordings manifest not found: {recordings_manifest}")
return
if not supervisions_manifest.exists():
logging.error(f"Temporary supervisions manifest not found: {supervisions_manifest}")
return
recordings_manifest = load_manifest_lazy(recordings_manifest)
supervisions_manifest = load_manifest_lazy(supervisions_manifest)
hdf5_path = output_dir / "video_features.h5"
with NumpyHdf5Writer(hdf5_path) as writer, tqdm(total=len(recordings_manifest)) as pbar, mp.solutions.holistic.Holistic(static_image_mode=False, model_complexity=0, min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
for recording in recordings_manifest:
try:
video_recording = Recording.from_dict(recording.to_dict())
video_cut = video_recording.to_cut()
video_path = video_recording.sources[0].source
logging.info(f"Loading video frames from {video_path}")
video_features = extract_features_from_video(video_path, holistic)
if video_features is None:
logging.error(f"Failed to load video frames for recording ID: {video_recording.id}, video path: {video_path}")
continue
start_time = time.time()
writer.store_array(video_cut.id, video_features, frame_shift=video_recording.sampling_rate, temporal_dim=0)
read_time = time.time()
logging.info(f"Time to write a video in H5: {read_time - start_time:.2f} seconds")
except Exception as e:
logging.error(f"Error processing recording ID {recording.id}: {e}")
pbar.update(1)
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--language", type=str, required=True, help="Language to process.")
parser.add_argument("--src-manifests", type=str, required=True, help="Path to the source directory containing manifest files.")
parser.add_argument("--output-dir", type=str, required=True, help="Path to the output directory to save the extracted features.")
return parser.parse_args()
if __name__ == "__main__":
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
logging.basicConfig(format=formatter, level=logging.INFO)
args = get_args()
logging.info(vars(args))
compute_sign_language_features(language=args.language, src_manifests=args.src_manifests, output_dir=args.output_dir)
reader = NumpyHdf5Reader("data/mediapipe_raw/video_features.h5")
import h5py
from lhotse import CutSet, MonoCut
def inspect_hdf5_file(hdf5_path):
with h5py.File(hdf5_path, 'r') as f:
keys = list(f.keys())
logging.info(f"Keys in HDF5 file: {keys}")
return keys
def load_cuts_from_hdf5(hdf5_path, prefix=""):
cuts = []
with h5py.File(hdf5_path, 'r') as f:
for key in f.keys():
if prefix in key:
data = f[key][:]
logging.info(f"Loaded data for key {key}: shape={data.shape}")
cut = MonoCut(id=key, start=0.0, duration=len(data) / 100.0, channel=0, features=None)
cuts.append(cut)
logging.info(f"Total cuts loaded: {len(cuts)}")
return CutSet.from_cuts(cuts)
# I loaded like that
@lru_cache()
def test_cuts(self) -> CutSet:
logging.info("About to get test cuts")
#reader = NumpyHdf5Reader(self.args.manifest_dir / self.args.test_manifest)
#return CutSet.from_hdf5(reader.hdf, prefix="test")
hdf5_path = "/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/data/mediapipe_raw/video_features.h5"
keys = inspect_hdf5_file(hdf5_path)
cuts = load_cuts_from_hdf5(hdf5_path, prefix="test")
logging.info(f"Total test cuts: {len(cuts)}")
return cuts
2024-07-04 22:12:03,466 INFO [train.py:1102] Training started
2024-07-04 22:12:03,471 INFO [train.py:1112] Device: cuda:0
2024-07-04 22:12:03,488 INFO [train.py:1124] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.ddde5bd.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/home/kerolos/projects/asr/icefall', 'k2-path': '/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'kerolos', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/exp/models/model_zipformer'), 'bpe_model': '/mnt/TB16/sign2text/train_SignModel/en/20_06_2024//exp//lang/bpe.model', 'base_lr': 0.03, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('/mnt/TB16/sign2text/train_SignModel/en/20_06_2024/data/mediapipe_raw'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'video_features.h5', 'dev_manifest': 'video_features.h5', 'test_manifest': 'kaldi_cuts_test.jsonl.gz', 'blank_id': 0, 'vocab_size': 2000}
2024-07-04 22:12:03,505 INFO [train.py:1126] About to create model
2024-07-04 22:12:04,187 INFO [train.py:1130] Number of model parameters: 69187431
2024-07-04 22:12:07,527 INFO [signlang_datamodule.py:337] About to get train cuts
2024-07-04 22:12:07,532 INFO [signlang_datamodule.py:50] Keys in HDF5 file: ['-fZc293MpJk_0-1-rgb_front', '-fZc293MpJk_2-1-rgb_front', '-fZc293MpJk_3-1-rgb_front', '-fZc293MpJk_4-1-rgb_front', '-fZc293MpJk_5-1-rgb_front',
2024-07-04 22:12:07,542 INFO [signlang_datamodule.py:62] Total cuts loaded: 0
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:343] Total train cuts: 0
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:234] About to create train dataset
2024-07-04 22:12:07,543 INFO [signlang_datamodule.py:242] Using DynamicBucketingSampler.
/home/kerolos/anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py:136: UserWarning: You are using DynamicBucketingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' to read the CutSet lazily, or use a BucketingSampler instead.
warnings.warn(
Traceback (most recent call last):
File "./zipformer/train.py", line 1386, in <module>
main()
File "./zipformer/train.py", line 1379, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1226, in run
train_dl = signData.train_dataloaders(
File "SignRcg/zipformer/signlang_datamodule.py", line 243, in train_dataloaders
train_sampler = DynamicBucketingSampler(
File "anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py", line 181, in __init__
self.duration_bins = estimate_duration_buckets(
File "anaconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic_bucketing.py", line 323, in estimate_duration_buckets
assert num_buckets <= sizes.shape[0], (
AssertionError: The number of buckets (30) must be smaller than or equal to the number of cuts (0).
writer.store_array(video_cut.id, video_features, frame_shift=video_recording.sampling_rate, temporal_dim=0)
Thanks in advance
Extend Lhotse to support video features for tasks such as sign language recognition (e.g., How2Sign) and human activity recognition. This enhancement will be useful for the Icefall platform.
Details
With the recent support for video in PR #1151, I am interested in developing a new recipe to handle video data and extract features using tools like MediaPipe.
Objectives
Recipe Addition:
lhotse/recipes
directory.Feature Extraction:
Implementation Steps
Create Manifest Files:
Recordings manifest (
recordings.jsonl
):Supervisions manifest (
supervisions.jsonl
):Feature Extraction Script:
Create a script
compute_features_sign_language.py
:Questions
I would appreciate any guidance or support on implementing this feature and utilizing it within the Icefall platform @pzelasko .
Thank you!