lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
947 stars 218 forks source link

Array-like features extraction #534

Open marcinwitkowski opened 2 years ago

marcinwitkowski commented 2 years ago

I'd like to make a CutSet with "Functional" features from OpenSmile (i.e. single vectors, not matrices, independent of the signal duration) with the following procedure:

download_yesno(...)
manifests = prepare_yesno(corpus_dir='waves_yesno')
config = OpenSmileConfig(feature_level="func")
feature_extractor = OpenSmileExtractor(config=config)
for partition, m in manifests.items():
    m = manifests[partition]
    cut_set = CutSet.from_manifests(recordings=m["recordings"],supervisions=m["supervisions"])
    cut_set = cut_set.compute_and_store_features(
        extractor=feature_extractor,
        storage_path=f"feats_{partition}",
        storage_type=LilcomHdf5Writer,
    )
    cut_set.to_json(f"cuts_{partition}.json.gz")

It required small changes in frame_shift(). You can run the code here: https://colab.research.google.com/drive/1D87TdZl0Bvgl15d6g2hOqTb9NydeV_aJ?usp=sharing Unfortunatelly the code fails at the moment, since CutSet.compute_and_store_features() calls further FeatureExtractor.extract_from_recording_and_store() which includes validate_features(), which eventually raises the error

AssertionError: Features: manifest is inconsistent: declared num_frames is 1, but duration (6.35s) / frame_shift (0.01s) results in 635 frames. If you're using a custom feature extractor, you might need to ensure that it preserves this relationship between duration, frame_shift and num_frames (use rounding up if needed - see lhotse.utils.compute_num_frames).

as a result of the fact that Functional features contain only a single vector.

A hack based on disabling feature validation solves the problem, which unfortunatelly will immediately raise again in validate_cuts during dataset creation.

This yields more general problem. In fact this is a nomenclature issue, because "Functional" (or any other array-like) features are not "Features" in terms of Lhotse, which require precise definition of frame_shift(). In my opinion, they fit rather to the Array definition. So is there (or plan for implementing) something like CutSet.compute_and_store_arrays() and/or ArrayExtractor for all those types of features? Or maybe Features class should be more general to support an embedding (i/x/d-vector) extraction? Or maybe there is some other way that allows for extraction and storing of array-like features?

The examples mentioned in https://github.com/lhotse-speech/lhotse/pull/504#issuecomment-985813091 , i.e. adding single-vector features as a custom field of a cut, is possible, but as far as I understand, requires computing features beforehand. So it is quite inconvenient solution to the problem IMHO.

pzelasko commented 2 years ago

Array was built exactly for this purpose. I would suggest to use a lower level API like this:

with LilcomHdf5Writer('my/output/path') as writer:
    for cut in cut_set:
        audio = cut.load_audio()
        array = my_extractor_fn(audio)
        cut.opensmile_functionals = writer.store_array(cut.id, array)

then later when reading, do

for cut in cut_set:
    functional_feats = cut.load_opensmile_functionals()