tldr-pages / tldr-translation-pairs-gen

Generates a structured dataset in various formats derived from tldr-pages.
https://opus.nlpl.eu/tldr-pages/corpus/version/tldr-pages
MIT License
4 stars 3 forks source link

Automated export? #81

Open sbrl opened 23 hours ago

sbrl commented 23 hours ago

As I understand it this repo is about exporting translation pairs from tldr-pages to OPUS for a high-quality translation pairs dataset. Given that opus claims the last export was ~August 2023, is it possible to automate the export via e.g. GitHub actions etc?

Then a) we don't have to worry about it, and b) opus get a nice updated dataset regularly.

kbdharun commented 5 hours ago

Hi, @sbrl we already provide exported datasets under the latest release that is automatically updated every month through GitHub actions (i.e. https://github.com/tldr-pages/tldr-translation-pairs-gen/releases/latest).

Additionally, I also publish the CSV dataset officially under our org in Kaggle at https://www.kaggle.com/datasets/tldr-pages/tldr-pages-translation-pairs-dataset.

Regarding the OPUS Corpus, they seem to update the dataset based on releases made to the repo here, so IG I will create quarterly releases so that the dataset is up to date upstream. (Will do one now)

sbrl commented 2 hours ago

Sounds good!

Yeah, I did see the kaggle dataset there, but hadn't explored it yet.