Open sbrl opened 23 hours ago
Hi, @sbrl we already provide exported datasets under the latest release that is automatically updated every month through GitHub actions (i.e. https://github.com/tldr-pages/tldr-translation-pairs-gen/releases/latest).
Additionally, I also publish the CSV dataset officially under our org in Kaggle at https://www.kaggle.com/datasets/tldr-pages/tldr-pages-translation-pairs-dataset.
Regarding the OPUS Corpus, they seem to update the dataset based on releases made to the repo here, so IG I will create quarterly releases so that the dataset is up to date upstream. (Will do one now)
Sounds good!
Yeah, I did see the kaggle dataset there, but hadn't explored it yet.
As I understand it this repo is about exporting translation pairs from tldr-pages to OPUS for a high-quality translation pairs dataset. Given that opus claims the last export was ~August 2023, is it possible to automate the export via e.g. GitHub actions etc?
Then a) we don't have to worry about it, and b) opus get a nice updated dataset regularly.