tldr-pages / tldr-translation-pairs-gen

Generates a structured dataset in various formats derived from tldr-pages.
https://opus.nlpl.eu/tldr-pages/corpus/version/tldr-pages
MIT License
4 stars 3 forks source link

Make one of the natively supported outputs ready for OPUS #2

Closed SethFalco closed 1 year ago

SethFalco commented 1 year ago

The primary motivation for this project is to produce a dataset for OPUS, yet this is not one of the supported outputs.

Rather than throwing a CSV, XML, or JSON file for them to process further, it'd be nice if we could just understand the format and support it out of the box in the first place.

Here is the format we should export to: https://opus.nlpl.eu/trac/wiki/DataFormats.html

SethFalco commented 1 year ago

We export in TMX now which should be good. :+1: