Open marcinwitkowski opened 2 years ago
Array was built exactly for this purpose. I would suggest to use a lower level API like this:
with LilcomHdf5Writer('my/output/path') as writer:
for cut in cut_set:
audio = cut.load_audio()
array = my_extractor_fn(audio)
cut.opensmile_functionals = writer.store_array(cut.id, array)
then later when reading, do
for cut in cut_set:
functional_feats = cut.load_opensmile_functionals()
I'd like to make a CutSet with "Functional" features from OpenSmile (i.e. single vectors, not matrices, independent of the signal duration) with the following procedure:
It required small changes in frame_shift(). You can run the code here: https://colab.research.google.com/drive/1D87TdZl0Bvgl15d6g2hOqTb9NydeV_aJ?usp=sharing Unfortunatelly the code fails at the moment, since
CutSet.compute_and_store_features()
calls furtherFeatureExtractor.extract_from_recording_and_store()
which includesvalidate_features()
, which eventually raises the erroras a result of the fact that Functional features contain only a single vector.
A hack based on disabling feature validation solves the problem, which unfortunatelly will immediately raise again in validate_cuts during dataset creation.
This yields more general problem. In fact this is a nomenclature issue, because "Functional" (or any other array-like) features are not "Features" in terms of Lhotse, which require precise definition of
frame_shift()
. In my opinion, they fit rather to theArray
definition. So is there (or plan for implementing) something likeCutSet.compute_and_store_arrays()
and/orArrayExtractor
for all those types of features? Or maybe Features class should be more general to support an embedding (i/x/d-vector) extraction? Or maybe there is some other way that allows for extraction and storing of array-like features?The examples mentioned in https://github.com/lhotse-speech/lhotse/pull/504#issuecomment-985813091 , i.e. adding single-vector features as a custom field of a cut, is possible, but as far as I understand, requires computing features beforehand. So it is quite inconvenient solution to the problem IMHO.