Complimentary sampler update following Lhotse's changes

pzelasko commented 2 years ago

This PR removes any mentions of arrow/pyarrow and uses lazy JSONL loading for the manifests. It also adjusts the arguments to K2SpeechRecognitionDataset. As a bonus, there's a torchscript conversion in GigaSpeech decoding which I find handy. I verified that I'm able to start and run the GigaSpeech L training with both Lhotse and Snowfall PRs. I also adjusted the dataloading stuff for LibriSpeech/Aishell in the same way, but haven't run them.

Sorry for the reformatting of asr_datamodule.py, hopefully the diff is not too messed up.

The corresponding Lhotse PR is https://github.com/lhotse-speech/lhotse/pull/345. These two PRs should be merged together; I'll wait for either @danpovey or @csukuangfj to ack before merging both.

pzelasko commented 2 years ago

This is ready to merge now after I confirmed decoding works; @danpovey @csukuangfj do you want to take a look?

danpovey commented 2 years ago

Great-- thanks! I had a quick look, it looks fine. I'll merge.

k2-fsa / snowfall

Complimentary sampler update following Lhotse's changes #238