opensource-spraakherkenning-nl / Kaldi_NL

Code related to the Dutch instance and user groups of the KALDI speech recognition toolkit
http://www.opensource-spraakherkenning.nl
Apache License 2.0
64 stars 16 forks source link

Re-containerize Kaldi_NL without LaMachine #22

Closed proycon closed 2 years ago

proycon commented 2 years ago

In light of proycon/LaMachine#214 a new container build that is standalone (no longer using LaMachine) needs to be implemented.

Kaldi poses two extra challenges:

proycon commented 2 years ago

We'll see if we can increase the granularity of the models so that only the models that are needed can be selected, and we'll move those into separate container mounts that can be orchestrated together. This should make things more scalable and less bulky.

(discussed with @roelandordelman and @jblom)

proycon commented 2 years ago

Change of plan: I'm not going to use Alpine Linux as I expect too many issues with non-GNU versions of tools. I'll just build on the image kaldi itself provides (debian).

sirifarif commented 2 years ago

@proycon Do you have a chance to look at this Docker file? It might help.

proycon commented 2 years ago

Thanks Arif! I've seen that one indeed, it provides a good reference (though it adds more than I want, like all the gstreamer stuff)

jblom commented 2 years ago

Ah yes the gstreamer stuff I also wanted to get rid of in my old attempt to get a leaner kaldi_NL image. I found this image I made back then (Note: 2 colleagues Willem/Jan-Willem also used a kaldi_NL image in their workflows):

Schermafbeelding 2022-02-18 om 08 52 48
proycon commented 2 years ago

I've completed the re-containerization now, hopefully addressing all concerns we had and resulting in leaner images. Please see the release notes for all details: https://github.com/opensource-spraakherkenning-nl/Kaldi_NL/releases/tag/v0.4.0

Experiments using the new containers would be much appreciated so we can fix any issues that may arise! The idea is that any further containers, e.g. with gstreamer, DANE, the asr_nl webservice or whatever else, can easily base off the ones we provided now.

PS: @jblom nice schema, perhaps we might want to make one for the new situation?

jblom commented 2 years ago

@proycon good to hear! Hope to test the new lean image this week. About the graphic: yes it might be nice to provide such an image (internal image structure + part with known FROM users of the image) at some point in the Kaldi_NL repo, but first I'd say a nice README with examples on how to use the image would be more useful (I can e.g. provide examples for k8s and refer to a fully fledged DANE ASR setup, e.g. helm chart).

But first let's test it out well.