tldr-pages / tldr-translation-pairs-gen

Generates a structured dataset in various formats derived from tldr-pages.
https://opus.nlpl.eu/tldr-pages/corpus/version/tldr-pages
MIT License
4 stars 3 forks source link

Duplicate translation units in output #11

Open SethFalco opened 1 year ago

SethFalco commented 1 year ago

In the project we have certain lines that are repeated frequently, namely the template text like for alias pages.

We should do something to avoid writing out duplicates. Cases where lines are similar but not the same like "More Info…" is perfectly fine since the link is changing, but cases like the alias pages are just 100% the exact same.