Storage of integer features + Feature extraction best practice

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

https://lhotse.readthedocs.io/en/latest/

Apache License 2.0

956 stars 219 forks source link

Storage of integer features + Feature extraction best practice #1407

Closed njellinas closed 3 days ago

njellinas commented 1 month ago

Hello, I have some features which are coming from codebooks and are int16 values. Which is the best way to store them?

Also, if I already have a manifest file with Recordings and Supervisions and I want to extract features, but not save a new manifest file. Can I attach the existing features during runtime to the CutSet created from said manifest file? In general which would be the best way to preprocess a dataset? The available manifests that are provided with lhotse include CutSets. In order to extract features and use them do I have to create new CutSets that include the features? Can't I have separate CutSets with recordings+supervisions and separate files for features? Would it be better to always have separate recording manifests, supervision manifests and feature manifests and just combine them during dataloading with the from_manifests function?

pzelasko commented 1 month ago

For storage you can use NumpyFilesWriter, sth like

with NumpyFilesWriter(...) as w:
    for cut in cuts:
        array = extract_codebook(cut)
        cut.codebook = w.store_array(cut.id, array, temporal_dim=..., frame_shift=...)

then if you save with cuts.to_file() the codebook manifest will be present in the cutset.

You can also do it shorter without writing to disk:

cut = cut.attach_tensor("codebook", extract_codebook(cut), temporal_dim=..., frame_shift=...)

in which case everything is kept in memory.

If you want to keep everything in separate files, I suggest looking into Lhotse Shar format which allows that (various fields are combined on the fly). This lets you have multiple versions of codebooks etc. and easily switch between them if you're experimenting with different models.

njellinas commented 1 month ago

1) I would like to utilize the lhotse compute_and_store_features function in order to be compatible with other type of features, i.e. interchange these codebooks with mel for other models, but with the same interface. I managed to pad the wav so that my own custom feature extractor passes the validation checks of lhotse for the number of frames, so the features are stored (I used the hdf5 writer).

There I have a problem where during loading lhotse converts the features back to float32. I made a custom class with the following line: self.hdf.create_dataset(key, data=value, dtype=value.dtype), so that the value is stored with the same dtype, but during loading with the PrecomputedFeatures strategy it loads them back as float32.

2) Let's say I have downloaded the libri-tts CutSet files from lhotse download scripts. Then I perform feature extraction with: cuts = self.cuts.compute_and_store_features and I save the features in the disk. Then I want to load the same CutSet of libri-tts that I downloaded and attach the existing features. Is this possible? Or I must save the new cuts that have occured from the above command?

pzelasko commented 3 weeks ago

but during loading with the PrecomputedFeatures strategy it loads them back as float32.

You might want to replace PrecomputedFeatures with sth like collate_matrices(c.load_features() for c in cuts); or modify PrecomputedFeatures to keep the original dtype (I'd be OK with a PR with this change).

cuts = self.cuts.compute_and_store_features and I save the features in the disk.

You can do FeatureSet(c.features for c in cuts).to_file("my_feats.jsonl.gz") and later

class LazyFeatureAttacher:
  def __init__(self, cuts, features):
    self.cuts = cuts
    self.features = features
  def __iter__(self):
    for c, f in zip(self.cuts, self.features):
      c.features = f
      yield c

cuts = CutSet.from_file(...)
features = FeatureSet.from_file(...)
cuts = CutSet(LazyFeatureAttacher(cuts, features))

njellinas commented 1 week ago

Do you know if this last operation is possible with lazy CutSets and FeatureSets? e.g. I have the libri cutsets that don't have features. Then I save the features as FeatureSets and then I want to combine them but lazily in order to use a DynamicBucketSampler. I could save the cuts that occur from the compute_and_store_features, but this is not very optimal, beacuse if e.g. the front-end changes, then I would have to recompute every feature for just some changes in the corpus.

pzelasko commented 1 week ago

Yes this would work with lazy datasets.

njellinas commented 1 week ago

I think unfortunately this won't work because if you have filters in the CutSet or they are saved in a different order, then the features attached correspond to a different utterance... So, the only way to be sure is to save the CutSet that occurs from the feature extraction I guess?

pzelasko commented 1 week ago

Make sure the feature set is sorted according to the cut set and apply any filter only after you attach the features.