tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
205 stars 70 forks source link

Work on translations to substitute into the quick reference guide #242

Open baskaufs opened 4 years ago

baskaufs commented 4 years ago

We should create tables for the basic DwC metadata in various languages that could be substituted into the Quick Reference Guide. It would probably require some changes to the build script, or a different build script.

kcopas commented 4 years ago

Might take some digging to find them, but we do have some earlier translations in the files at the @gbif secretariat. Plus all the terms included in the eight languages now on GBIF.org, if it’s helpful.

Sent with GitHawk

baskaufs commented 4 years ago

Yes, I have a copy of translations somewhere, but I think they are outdated and would need to be checked against the current list and have their definitions checked to make sure they are up to date. But I think first I need to work out how to get the DwC infrastructure to make use of them. I'll make a more serious effort to track down existing work after that. Thanks1!!

baskaufs commented 4 years ago

OK, the table I have is here. I don't remember where they came from. Definitions are in en, es, zh_hans, and ja. Labels are in the same languages except for ja.

MattBlissett commented 3 years ago

OK, the table I have is here. I don't remember where they came from. Definitions are in en, es, zh_hans, and ja. Labels are in the same languages except for ja.

I think that might have come from http://rs.gbif.org/terms/dwc/dwc_translations.rdf .

The GBIF IPT would benefit from translations. Should we investigate setting up something to generate these translations (terms + definitions) as part of our translation system? (i.e. Crowdin). If it's given some XML with English attribute values, it would produce similar XML with translated attribute values (one file per language).

tucotuco commented 3 years ago

I think that would be awesome.

baskaufs commented 3 years ago

@MattBlissett Two questions:

  1. We have scheduled a workshop at the TDWG conference to get volunteers to work on translations of controlled vocabularies. Would it be possible to run the existing labels and definitions through your system and have the translators correct them rather than having to start from scratch?
  2. Are we wedded to XML? When representing multilingual labels and definitions, I have been trying to use a consistent JSON-LD format as demonstrated here: https://heardlibrary.github.io/digital-scholarship/lod/json_ld_test/establishmentMeans.json This form allows for the data to be both born Linked Data-ready (ingests into a triple store as RDF consistent with the Standards Documentation Specification) but also easily read by Javascript and scripting languages like Python (for example used to generate this: https://heardlibrary.github.io/digital-scholarship/lod/json_ld_test/display-cv.html?nl. As you know, reading XML isn't impossible, but I've found it to be a lot more clunky that JSON.
MattBlissett commented 3 years ago
  1. Existing translations could be imported. About 80% of the labels exist, but not many of the definitions match — the translations are very old, and pre-date the splitting out of comments/examples, and lots of smaller changes.

  2. No; in fact CSV might be easiest. If we're to use the Crowdin system (under GBIF's account or a new one) then we'll need some scripting to structure the data in a way Crowdin can make best use of it, i.e. presenting it in a reasonable way for the translators. The benefit of the system is in tracking changes — an English definition can be changed, and translators are then prompted to check/retranslate.

Here's a rough example. I've taken the DWC term labels, definitions, comments and examples and split them into 4 separate CSV files, keyed on the short name. This is the English labels: https://github.com/gbif/crowdin-asciidoctor-testing/blob/translation_master/dwc_labels.en.csv (the columns are key, translation context and translation source string). Crowdin picks that up from GitHub, and the translators input the translations in this interface: image

(I imported the existing Spanish translations.)

Crowdin then generates this file, which has the English string replace by Spanish: https://github.com/gbif/crowdin-asciidoctor-testing/blob/translation_master/dwc_labels.es.csv

baskaufs commented 3 years ago

Oh, I hadn't picked up on the fact that Crowdin is crowd-sourcing and not AI. Then it's probably not the best for the controlled vocab translating since the idea was to have humans who were content experts do the translating.

CSV is awesome. That's what I've been using to generate the JSON-LD anyway.

What I've been drawing on to generate the JSON is something like this: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations.csv

Aside from the first three columns, the rest of the columns are all just label/definition pairs for each language, with column mappings to properties and language attributes here: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations-column-mappings.csv But separate files would also be fine and I hadn't been handling the examples and comments since they don't exist for controlled vocabularies at this time.

Anything similar that's on GitHub could be used as a common source for GBIF (or anyone else) as well as by the DwC team to generate translations of the various term lists/guides.

MattBlissett commented 3 years ago

Just quickly: we can easily choose the crowd for CrowdIn, and allocate only experts to the project.

I'll absorb the rest of what you wrote tomorrow.

MattBlissett commented 3 years ago

The main advantage of Crowdin (or a similar tool, like Weblate) is — with some scripts — it automates the process of distributing files to translators, and integrating the resulting translations.

Translators would be chosen however we'd like, e.g. invited experts, paid translators, volunteers. Translators can discuss or comment on any translation string. New or changed translations are fed back into Git as a pull request from Crowdin.

CSV is awesome. That's what I've been using to generate the JSON-LD anyway.

What I've been drawing on to generate the JSON is something like this: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations.csv

This is also what I used to import the Spanish translations in the demo. I used https://github.com/tdwg/rs.tdwg.org/blob/master/terms/terms.csv for the source strings.

But separate files would also be fine and I hadn't been handling the examples and comments since they don't exist for controlled vocabularies at this time.

Crowdin supports either, but I chose separate files since I think they're easier for people to review – a changed line in one file clearly affects only a single language.

MattBlissett commented 2 years ago

Hi @tucotuco, @baskaufs, @debpaul, @pzermoglio

I've improved the Crowdin integration for translation of Darwin Core term labels, definitions etc.

I've enabled it only for Darwin Core terms, and although the examples, comments etc are also available for translation, so far only the labels and definitions are used -- the same as Steve did for establishmentMeans, degreeOfEstablishment etc in the conference workshop last year.

The process runs on GBIF's Jenkins server. It runs whenever prompted by a change on GitHub, and after a ~10-30 minute delay from a change on Crowdin. It

  1. finds any changes in terms/terms.csv, and sends these changes to Crowdin
  2. takes any new or updated translations from Crowdin and adds them to terms/terms-translations.csv.

Steve's script generates files like establishmentMeans.json from establishmentMeans/establishmentMeans-translations.csv, and I'm using exactly the same structure for terms/terms-translations.csv. I haven't tested if Steve's script works on terms-translations.csv, as the script is still on a branch and I've not looked at how it works.

If you go to https://crowdin.com/translate/darwin-core/58/en-fr?filter=basic&value=3 (or your preferred language) and add another translation (copy one from the old translations or the GBIF portal French translation) and wait about 10-30 minutes, you should see the update on terms/terms-translations.csv.

There are settings within Crowdin to add a review step before the new/changed translation is exported, but for simplicity it's not enabled at present.

baskaufs commented 2 years ago

This is so cool, Matt! I've put it on my todo list to try running my script with the file you generated.

The reason the script is on a branch is that I set up a branch (gh-pages) for GitHub Pages so that the generated JSON files would get served with the correct Content-Type headers. So that branch won't ever get merged. That may not be the best setup in the long run -- I didn't use the usual "docs" folder option because I already had a docs folder that I was using for something else.

MattBlissett commented 10 months ago

I have made improvements to the process, and (I think) imported all the results from the TDWG workshop a couple of years ago. I haven't tried to import any other definitions — I think it's best if translators do that themselves, one at a time, as many of the definitions have changed.

Translations done in Crowdin will end up in these CSV files, around 1-2 hours after the change is made in CrowdIn:

Parts of this repository used for the website (e.g. terms.tmpl) can also be translated, this is easily set up in Crowdin. That could mean there terms.es.tmpl and so on appear.

peterdesmet commented 10 months ago

Ping @ben-norton (in case he is not subscribed to this issue).