openedx / openedx-translations

Open edX Translation files in sync with Transifex
Creative Commons Attribution Share Alike 4.0 International
3 stars 39 forks source link

Ensure all strings are valid and don't break builds #549

Open brian-smith-tcril opened 1 year ago

brian-smith-tcril commented 1 year ago

@OmarIthawi mentioned an example where invalid strings could break a build here.

what do you think should happen if a po file fails to compile due to Translator writing invalid strings Hi {user} --> Merhaba {kullanci}?

When the strings were checked into the repos directly, this would fail the individual repo CI. With the switch to openedx-translations/openedx-atlas, strings updated in transifex and synced to openedx-translations will not trigger a CI build on any given repo.

We'll want to ensure we don't have strings that break builds checked into openedx-translations, so we should add validation somewhere.

In the same comment, @OmarIthawi mentioned

I'm thinking we should add validation to the https://github.com/openedx/openedx-translations

That seems like a good plan to me

### Tasks
- [ ] Validate JavaScript (React) translation to avoid breaking the build.
- [x] Validate Python (pofile) translations: https://github.com/openedx/openedx-translations/pull/564
- [x] Implement full validation for Python strings like `i18n_tool validate` in addition to `msgfmt` of https://github.com/openedx/openedx-translations/pull/564
OmarIthawi commented 1 year ago

@brian-smith-tcril I looked up for tools to validate JSON translation files, but found none online.

Does it make sense to convert json to po-files and validate them? I'm not sure if there's a two-way conversion between the two format.

I know that Apache superset has a po2json utility, but I haven't tested it.

brian-smith-tcril commented 1 year ago

@OmarIthawi it looks like translate-toolkit could work, it supports converting to and from json.

I was hoping there would be a way to validate without converting out of json by just using the formatjs cli directly, but that doesn't seem to be possible.

I'm not sure if there are any strange edge cases that will trip up the validation with the conversion back and forth, but I think trying out validation with translate-toolkit's json2po and po2json is a reasonable path forward.

timmc-edx commented 3 months ago

It seems that there has been a regression in validation. For example, see this translation where the slot/param name was translated:

https://github.com/openedx/openedx-translations/blob/171142ac6ca8eaabcfea7940d2be935f2264d074/translations/edx-platform/conf/locale/ar/LC_MESSAGES/django.po#L23603-L23605

We have tooling in https://github.com/openedx/i18n-tools that's supposed to catch this sort of thing. It currently picks up about 300 errors in edx-platform alone. Is there some reason this is no longer in use?

OmarIthawi commented 3 months ago

@timmc-edx We have tests in place that is supposed to match i18n-tools, however it sounds like it needs more work.

https://github.com/openedx/openedx-translations/blob/e9a7fe1e67b3228fd8ba2afc51c11e2552a319ff/scripts/validate_translation_files.py#L32-L36

I couldn't make i18n-tools work with non-standard file structure such as translations/edx-platform/... therefore I used msgfmt directly.

What do you suggest?

timmc-edx commented 3 months ago

I don't know exactly what msgfmt checks, but I think it's just basic file structure. i18n_tool checks a variety of other things, including whether there are the same parameter slots in the inputs and outputs. You can see all the things it checks here: https://github.com/openedx/i18n-tools/blob/master/i18n/validate.py

I think it would be appropriate to modify that tool to allow different directory structures (or even just expect a different one).

OmarIthawi commented 3 months ago

Yes, it seems like msgfmt is only a part of i18n-tools/blob/master/i18n/validate.py.

I'll keep this issue open for now. Unless someone works on it, I plan to fit within my Open Source contributions late August / early September.

timmc-edx commented 3 months ago

For reference, here are the strings that i18n_tool currently picks out as broken in edx-platform: https://gist.github.com/timmc-edx/eeaa4c1a40ee07871049cba53b30b4fa

OmarIthawi commented 3 months ago

Thanks @timmc-edx. That's mostly machine translation issues. I recall the Arabic strings to be of high quality and haven't had such issues in the past.

The solution I have in mind is to:

Fixing them by hand all at once might be prohibitively costly.

What do you think?

timmc-edx commented 3 months ago

I think we could do it by hand -- I bet there's a way to bulk find-and-unapprove translations in Transifex, and we could use search strings like < and notranslate to find them. I'd be fine doing that work in between other tasks, and I'd really like to avoid having the site untranslated in those languages for the duration.

But yeah, I think the first step would be adding i18n_tool to the CI step so that translation updates are blocked until they get cleaned up.

OmarIthawi commented 3 months ago

@timmc-edx there's a way to bulk review/unreview/delete transaltions. But it'll be one language at a time.

I have capacity issues at the moment regarding my contributions time i.e. I've done much more than planned this month and I need to focus on other projects at hand.

But yeah, I think the first step would be adding i18n_tool to the CI step so that translation updates are blocked until they get cleaned up.

Yes.

If help is there, I'd really appreciate it. Otherwise I'd have to clear my plate a bit.