Extending Lhotse dataloading to text/multimodal data

This PR adds a very basic support for incorporating text-only data into Lhotse samplers to enable text and multimodal dataloading. Highlights:

new ABC SamplingConstraint that generalizes TimeConstraint, and allows to create other types of constraints to decide when to stop sampling a mini-batch as well as how to determine the "size" of an example (e.g. for audio its duration, but for text it may be sth like num tokens)
dynamic samplers have a new argument called constraint where SamplingConstraint instances may be passed directly
TokenConstraint which is almost identical to TimeConstraint but uses num_tokens / max_tokens
very basic dataclass TextExample that wraps text/tokens, CutSet can be used to yield those (just pass text iterator to CutSet like CutSet(text_example_iter)) (it's not super clean but it works; trying to figure out if we can make this cleaner)
unit tests illustrating how to use this for text dataloading and even for mixed modality dataloading (text and audio data together in a mini-batch)

This is stretching the original scope of Lhotse a bit, but I feel like it's worth it: we accumulated a bunch of solid techniques here and it'd be a pity to have to use something completely different for multimodal modeling, especially when so little changes are required to make it work here. Would love to know your thoughts @danpovey @csukuangfj @desh2608 @m-wiesner

lhotse-speech / lhotse

Extending Lhotse dataloading to text/multimodal data #1295