r-devel / translations

subsite for translations
https://contributor.r-project.org/translations/
Creative Commons Attribution 4.0 International
1 stars 3 forks source link

batch fix leading and trailing whitespace #12

Closed daroczig closed 7 months ago

daroczig commented 1 year ago

Thanks a ton to @reikookamoto for raising this:

Example of "legacy" translation in Japanese that has starting and trailing whitespace: https://translate.rx.studio/translate/r-project/grdevices-r/ja/?q=check:begin_space Example of a translation created during the Sprint via accepting a suggestion - the automated checks remove the starting and trailing whitespaces: https://translate.rx.studio/translate/r-project/parallel-r/ja/?q=state:%3E=translated This creates an inconsistency in Japanese (and potentially other languages) where formatting might slightly differ between "legacy" translations and those added using the Weblate web app

It's annoying to fix all the strings manually in the UI, so let's look for a way to fix these stings in an automated/batch process.

MichaelChirico commented 1 year ago

Side note: we might disable this check:

image

It doesn't understand that different languages can use different "full stop" characters:

image

MichaelChirico commented 1 year ago

My thought for this is to leverage {potools}:

pot = potools:::get_po_messages("src/library/base/po/R-base.pot")
ja = potools:::get_po_messages("src/library/base/po/R-ja.po")

pot[, msgid_plural_str := sapply(msgid_plural, paste, collapse = "||")]
ja[, msgid_plural_str := sapply(msgid_plural, paste, collapse = "||")]

merge(
  pot, ja,
  by = c("message_source", "type", "msgid", "msgid_plural_str")
)[
  grepl('^\\s+', msgstr.x) != grepl('^\\s+', msgstr.y),
  .N,
  by = message_source
]
#    message_source   N
# 1:              R 502

Not sure why the number doesn't exactly match what Weblate sees, but we should be able to use this as a first pass & see what else shakes out.

image

@daroczig, what's the best way to proceed from here? Strip the whitespace in that data.frame, write out a fresh po file, copy that into the weblate mirror somehow?

daroczig commented 1 year ago

we might disable this check [mismatched full stop]

I think it was only due to the extra trailing space, no? I've dropped the leading and trailing space and it resolved all 3 errors at https://translate.rx.studio/translate/r-project/base-r/ja/?q=%22%27%25s%27+is+defunct.%22&sort_by=-priority%2Cposition&offset=1#history

image

daroczig commented 1 year ago

what's the best way to proceed from here? Strip the whitespace in that data.frame, write out a fresh po file, copy that into the weblate mirror somehow?

I'm not sure if we need R for this - couldn't we just sed -i the ja.po files to drop the leading and trailing spaces in the "msgstr" lines? That might be the least intrusive operation I think .. although we might face issues with multi-line strings. Thoughts?

MichaelChirico commented 1 year ago

I think it was only due to the extra trailing space, no?

Oh, great. Seems odd the check duplicates the logic, but glad we can keep this otherwise useful consistency check.

we might face issues with multi-line strings

Yes, exactly why I wanted heavier machinery. In principle we could use awk for this. What I had in mind is a one-time fix, though, so the exact tooling is not so important. After that, the Weblate UI will show a warning, I hope that is enough, especially combined with checkPoTools() being used.

daroczig commented 1 year ago

Yeah, agreed it should be a one-time fix. And potools seems like a perfect tool for this :)

Regarding how to get this into weblate, we have a few options:

But we need to be cautious with the fuzzy tags, and also to keep track of what is to be reviewed and what was approved. Maybe better to test in a playground/sandbox component first? :zipper_mouth_face:

MichaelChirico commented 7 months ago

Did the simple sed approach here:

https://github.com/MichaelChirico/weblate-translations/commit/a9c6b6c30b4aaccea868afc61d1f9788e028f7a6

# leading ws
for FILE in $(grep -Elr '^msgstr " ' --include=*.po); do sed -Ei 's/^msgstr " /msgstr "/g' $FILE; done
# trailing ws
for FILE in $(grep -Elr '^msgstr.* "$' --include=*.po); do sed -Ei 's/^(msgstr.*) "$/\1"/g' $FILE; done

@daroczig PTAL.

We can address multiline issues later; looks like there's about 117 hits for the multiline case:

grep -Pzor '\n[^\n#]* "\n\n' --include=*.po
daroczig commented 7 months ago

that looks great, thanks!! :bow:

MichaelChirico commented 7 months ago

Hmm, not sure I've done the right thing here to proceed.

  1. I tried git push weblate master from my local copy with this commit applied; push got rejected
  2. I tried logging in to the weblate server and applying the changes at /var/lib/docker/volumes/weblate-docker_weblate-data/_data/vcs/r-project/base-r-gui (needed root access first). I don't see any pending sync needed at https://translate.rx.studio/projects/r-project/#repository, and I still see some of the changed strings on Weblate with leading/trailing whitespace in the UI.
  3. I tried logging in to the docker instance and looking at the repo at app/data/vcs/r-project/base-r-gui. Actually the commit from step 2 is already there.

I also glanced at the approach you mentioned above (manually uploading the edited files through the UI) but there are way too many individual files for that to be sustainable.

Any idea how to proceed (maybe we just need to wait for my manual edit to get picked up by the UI?)

MichaelChirico commented 7 months ago

Oh! I see the changes on Weblate now :tada:

image

@daroczig you didn't do anything yet I'm assuming?

MichaelChirico commented 7 months ago

OK, I took a pass at handling the multi-line case manually now too. I think there are a few more types of case missed, but this gets us in a much better state / shouldn't be as much of nuisance as a translator anymore. Closing.