[Machine Translation] Modify scripts to Download translations

roadlittledawn commented 3 years ago

Summary

<!> Please use the https://github.com/newrelic/docs-website/tree/feature/machine-translation feature branch

Similar to #2536 the interaction with the Smartling API is identical for both Machine/Human Translation. The deserializing method should also not change apart from the additional considerations.

In the current script called in the workflow check-job-progress.js we are picking up the projectId as an environment variable, so as we are creating a separate workflow for Machine Translation we can set the env variable there. This way we should just need to add a flag when running the script to tell it how to deserialize the data. Something like:

node ./scripts/actions/check-job-progress.js --machine

This will need to be passed through to...

Fetching

We currently fetch translations using the $PROJECT_ID, so this shouldn't need to change in fetch-and-deserialize.js. We will just need to add logic for when we pass through the --machine flag from check-job-progress.js when...

Deserializing

When we fetch the completed translation doc, we will need to add frontmatter to each translated doc to signify that it has been Machine/Human Translated. Using the --machine flag we will be able to tell which frontmatter to add. (This will be needed to determine whether to show the Disclaimer as part of https://github.com/newrelic/docs-website/issues/2537)

E.G.

translationType: machine/human

Currently, once we download a translated doc we then strip the translate key:value out of the frontmatter so the method for editing the frontmatter would be largely the same, (see here)

Accounting for `project_id`

The functionality for updating the tables with completed job status should not change, the project_id should already be in the tables and we are just updating the status column. But it would be good to double/triple check that.

For Downloading and Deserliazing, the relevant scripts are:

🛑 Testing Scripts

You can use the MT Project ID for this but only test with a 1 word change, [see here](https://github.com/newrelic/docs-website/blob/develop/scripts/actions/translation_workflow/testing/README.md#make-a-change-to-translate) The reason for this is we have a `2 million` word limit per year specifically for Machine Translation >Average is about 850 words per document. total is about 1.6 million. For MT it’s a total of 2 million words can be translated over a year. we have approximately 1800 pages x 850 (avg word count per page) = 1.6 million

Acceptance criteria

[x] Should fetch translated docs from the Machine Translation Project and deserialize them
[x] Should add frontmatter translated: machine/human to all pages after fetching to signify how it has been translated
[x] Scripts should account for new project_id column in tables

roadlittledawn commented 2 years ago

@rudouglas

Sorry for the churn on this. Thinking through this is tricky stuff!

TL;DR: Perhaps adding that info via frontmatter is the way to go. I can't think of other big hang-ups to not do it.

Thoughts on adding frontmatter to indicate machine/human translated

When the site builds and a template renders a page, I _think_ we could still infer if a locale namespaced page was human/machine translated by looking at the English counterpart MDX file frontmatter (via graphql). Buuut that comes with a downside: if an English file's frontmatter is changed to add it to human translation for a given language and it's deployed, on build we'll still check that and erroneously think the machine translated counterpart for that language. We would no longer display the "disclaimer" for the given page, though that would be wrong until we got the resulting human translation back in a PR and merged. And same issue vice versa if an English file's frontmatter is changed to remove it from human translation for a language.

rudouglas commented 2 years ago

@roadlittledawn Not at all, tis a wild beast of an epic, we must be sure to tame it right. I'll detail what I was thinking here just for full context.

My 🪙 🪙

We are translating to both `jp` and `kr`. The english version of a specific page has `translate: ['jp']` so we know to send it to human translation for `jp` and Machine translation for `kr`. When they come back, they get marked: - `jp` file with `translated: human` - `kr` file with `translated: machine` - We now have 3 separate files We change the English version to `translate: ['kr']` We now send to HT for `kr` and MT for `jp`. When they come back we replace the frontmatter again: - `jp` file now has `translated: machine` - `kr` file now has `translated: human` Regardless of what happens in the English version, the file will always be sent for translation to the correct project, and be marked with the correct frontmatter when it comes back

One other separate thing we have to determine actually is what we do in the following scenario:

We send a file for translation
That file is added to the exclusion list
We download the file and see that it is on the exclusion list
What happens next? Do we just disregard it?

roadlittledawn commented 2 years ago

@rudouglas yep, that flow makes sense to me. one [nit] though if i may. can we change the name of the frontmatter field to something like translatedBy or translationType?

Re: the scenario you lay out

When we download it from translation why would we need to check if it's excluded? Wouldn't we have done that when adding to queue to determine if it should be HT / MT? And when downloading, we will know it's MT because of the project it's associated with right? I suppose a similar scenario would be we add a file to the translation queue for MT. And before we send it off (because it will likely run on an interval like HT which is ~ every two weeks), someone adds an exclusion rule that would exclude it. But if we don't check again we would send it off. I think this case would probably rare, because we'll have a good sense of what we want to exclude from day 1. is there anything we could put in place to warn / remind someone to ensure any relevant files that are in the queue / uploaded aren't excluded by the rule they are adding?

rudouglas commented 2 years ago

Loving these collapsers 🎂 ye that makes it more obvious i like translationType

That's exactly the scenario I was thinking of yeah, it's probably rare enough to not need to worry about it right now and just have some kind of warning. The simplest solution would be to just have a comment in the exclusions.yml file to remind people to check, but is there an easy way for them to check? If we need to write code to implement a warning I would argue that we might as well code the check into the automation anyway, it should be very similar to the code adding it to the queue anyway and we can write it with that use case in mind.

Either way translating a couple of files that should have been excluded isn't really a big deal, the bulk of the translations will be done in the initial run so as long as the exclusions are correct at the start this might not be something we need to consider right now

newrelic / docs-website