ConcatSeqsDataset with extended functionality

Stefanwuu commented 5 days ago

Since it turns out helpful for some datasets to concatenate sequences during the training, some new functionality might be desired for future training usage.

More generic implementation for alignment. The current implementation supports padding frames between concatenated sequences to a multiple of the frame rate to assure valid alignment after concatenation. But for alignments with raw WAV, the situation is more complicated. Therefore, I propose a strict length constraint for alignment.
On-the-fly decision of whether to concatenate two sequences or not. The current implementation allows us to pre-define whether we have a single sequence or concatenated sequences. But in training, this means the model always sees some sequences as single and some as concatenated together. An on-the-fly decision of whether to concatenate or not might add more regularization. But I'm still considering how to do this within the current framework, any discussion is welcome.

albertz commented 5 days ago

A couple of related thoughts (but but more) are in #292. (This is for TF, but most of the discussion is generic and can be applied in the same way to PT, or even be backend independent.)

In case of PT, I think an easy way right now is to implement that as another IterDataPipe, e.g. like ChunkingIterDataPipe. Currently in our PT Engine._create_data_loader, it is somewhat hardcoded what pipe transformations we construct, but we could make this more configurable/flexible, that the user could put arbitrary own things in between.

Note that there are specific assumptions on what kind of data flows through the pipe (e.g. how seq lens are stored), and this is currently not well defined, and we planned to replace that by the well defined TensorDict at some point (#1302). So if you depend on the current behavior, then this might break in the future.

Another approach would be to just extend ConcatSeqsDataset, or implement another more flexible dataset, and then the logic applies on the dataset, independent from the backend. I personally would maybe leave ConcatSeqsDataset untouched and make some generic DynConcatSeqsDataset or so where the user gives a subdataset (just like for ConcatSeqsDataset) and then some generic function which decided what sequences to concatenate, and another function where the user can do the concatenation in whatever way he/she wants.

Stefanwuu commented 5 days ago

Another approach would be to just extend ConcatSeqsDataset, or implement another more flexible dataset, and then the logic applies on the dataset, independent from the backend. I personally would maybe leave ConcatSeqsDataset untouched and make some generic DynConcatSeqsDataset or so where the user gives a subdataset (just like for ConcatSeqsDataset) and then some generic function which decided what sequences to concatenate, and another function where the user can do the concatenation in whatever way he/she wants.

I personally also prefer the idea of a 'DynConcatSeqsDataset', maybe I can do sth about this.

Independent of that, here is a link to my implementation for forced alignment restriction that allows training with raw wav alignments. #1574

rwth-i6 / returnn

ConcatSeqsDataset with extended functionality #1573