lingua-libre / operations

⚙️ Configuration files and deployment procedures for LinguaLibre wiki.
MIT License
0 stars 1 forks source link

Fileformat in ogg? #6

Open WikiLucas00 opened 3 years ago

WikiLucas00 commented 3 years ago

In lingua-libre/operations/create_datasets.sh, we can read on line 37, 43 and 52 that the fileformat asked is ogg (while the chosen format for Lingua Libre files on Commons is wav). Is there a reason for using this format in the datasets, and is there a reason preventing us from changing it for wav format?

All the best

Poslovitch commented 3 years ago

As far as I know, ogg tends to be far smaller than wav (it's a format that makes use of dynamic compression, while wav is not a compressed format). Considering that the backend server has limited disk space and resources (incl. bandwidth) as most of it must be assigned to an ever-increasingly-resource-intensive BlazeGraph, we can't afford downloading and storing hundreds of GB of recordings.

IMHO, the current way datasets work has already become unsustainable. Switching from ogg to wav is not going to help :confused: .

hugolpz commented 2 years ago

Context

CBR stands for constant bitrate and is an encoding method that keeps the bitrate the same all along the file. VBR, by contrast, is a variable bitrate. Bitrate varies within the file along time.

Issues