lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
905 stars 204 forks source link

Question about multiple separate Supervision objects per segment/cut #995

Open danpovey opened 1 year ago

danpovey commented 1 year ago

Hey, We want to get into building models that can simultaneously handle different styles of text, and for this we will have multiple supervisions covering the exact same segment, which should be viewed as alternatives. What I want to ask is, what is the best way to do this? Is it better to duplicate the cut-id and have the cuts overlap, and one Supervision per cut; or have multiple Supervisions per cut with the same time boundaries? Bear in mind that this might be with quite large data and we might want to process it and randomize it mostly with jsonl files, treating the different Supervision objects probably separately.

pzelasko commented 1 year ago

How about using the custom field to hold all of the style variants? You can see how it could work with the following snippet:

import random

import torch.utils.data

from lhotse import SupervisionSegment, MonoCut, CutSet
from lhotse.dataset import DynamicCutSampler
from lhotse.testing.dummies import dummy_recording

supervision = SupervisionSegment(
    id="id",
    recording_id="rec_id",
    start=0,
    duration=5,
    # you can actually leave 'text' field empty, Dataset will read directly from the custom field
    # text=None,
    custom={
        "styled_texts": [
            "style variant one",
            "Style variant #1.",
            "STYLE VARIANT ONE",
            "Style VARIANT 1",
        ]
    },
)

cut = MonoCut(
    "cut_id",
    start=0,
    duration=5,
    channel=0,
    supervisions=[supervision],
    recording=dummy_recording(unique_id=0, duration=5, with_data=True),
)

class MultiTextASRDataset(torch.utils.data.Dataset):
    def __init__(self, seed: int = 0):
        self.rng = random.Random(seed)

    def __getitem__(self, cuts: CutSet) -> dict:
        # load audio/features, augment data, etc.

        texts = [self.rng.choice(c.supervisions[0].styled_texts) for c in cuts]

        return {
            # ... features, etc.
            "texts": texts
        }

cuts = CutSet.from_cuts([cut]).repeat()
sampler = DynamicCutSampler(cuts, max_cuts=4)
dloader = torch.utils.data.DataLoader(
    MultiTextASRDataset(), sampler=sampler, batch_size=None
)

for batch in dloader:
    print(batch)
    break

Output:

{'texts': ['Style VARIANT 1', 'Style VARIANT 1', 'style variant one', 'STYLE VARIANT ONE']}
danpovey commented 1 year ago

@pzelasko I am a little confused about the example output above: {'texts': ['Style VARIANT 1', 'Style VARIANT 1', 'style variant one', 'STYLE VARIANT ONE']}, shouldn't it have just one string for the 'texts' value? [oh, I think I got it, the __getitem__ is given the batch of items and it prepares them as a minibatch, and we've done repeat()...]