git-l10n / git-po-helper

Helper program for checking l10n contributions
Other
15 stars 6 forks source link

RFC: Removing po/git.pot from in-repo, dynamically adding the <file:line> comments #12

Closed avar closed 2 years ago

avar commented 2 years ago

I gave implementing $subject a try, after trying to clean up some old issues in the Makefile integration that date all the way back to when I originally added the i18n support.

I tried creating an RFC PR, but it kept redirecting me to git/git when trying to compare against my fork of it: https://github.com/git/git/compare/master...avar:avar/Makefile-incremental-po-git-pot-rule; Also they're changes outside of po/, so should go through the ML.

@jiangxin Some of the early stuff should definitely be uncontroversial, e.g. making make pot suck less and not do reset --hard.

But what do you think about things closer to the tip of the branch in general, i.e. getting rid of checking-in po/git.pot. Elaborated on in https://github.com/git/git/commit/ba3150650d3dd7f4c8d9744846e65f09f3cf0db0

I think getting rid of the "diff churn" for po/ would be great, and AFAICT the missing bits are:

  1. You wouldn't be able to update po/git.pot in your repo, instead we could either just commit a git rev-parse HEAD >po/git.pot.rev, or you could publish a ref/tag with the current OID from which to make po/git.pot.
  2. We'd need a tiny bit of tool juggling so that when you work on a po/LANG.po you'd do so with the version that has file:line comments, and then for submission we'd scrape those out before pushing to git.git.
    1. I hadn't looked over git-po-helper in any detail before. A lot of it is things that really seem to belong in an extended validation tool, e.g. the typo checking, but for e.g. checkGettextIncompatibleIssues() perhaps it's something we should add to the make check-po I added in the above series.
    2. Ditto for util/init.po. I.e. I hacked up a basic replacement in https://github.com/git/git/commit/a1bd6ae1f9be6cd42666c29b7e65a6434320a624; once that's extended a bit to give a new po/* file a canonical header.
    3. Anyway, I looked into it because for #2 above we'd need to integrate the msgmerge somehow with my new make rules. I.e. for the whole header juggling
    4. I haven't checked, but textconv filters should be able to make all of that transparent. I.e. the repo would store the version without the file:line, but we'd restore it on-demand, and on commit scrub it out.
avar commented 2 years ago

Update: For #4 I finished that up in : https://github.com/avar/git/commit/6d18427529031d0332b620eb4def3f28517d4a43

avar commented 2 years ago

For the make core-pot in particular, did you consider something like:

diff --git a/gettext.h b/gettext.h
index d209911ebb8..f5c6310173e 100644
--- a/gettext.h
+++ b/gettext.h
@@ -27,6 +27,7 @@
 #endif

 #define FORMAT_PRESERVING(n) __attribute__((format_arg(n)))
+#include "trace2.h"

 #ifndef NO_GETTEXT
 void git_setup_gettext(void);
@@ -45,6 +46,7 @@ static inline FORMAT_PRESERVING(1) const char *_(const char *msgid)
 {
        if (!*msgid)
                return "";
+       trace2_data_string("gettext", NULL, "msgid", msgid);
        return gettext(msgid);
 }

Which we could hide behind some ifdef, and then:

GIT_TRACE2_EVENT=/tmp/tr2.json make test

Running that I get e.g.:

vm ~ (master $=) 0 $ grep msgid /tmp/tr2.json |grep -o value.*|sort|uniq -c|sort -nr|grep -v ' [123] '|wc -l
1293
vm ~ (master $=) 0 $ grep msgid /tmp/tr2.json |grep -o value.*|sort|uniq -c|sort -nr|grep -v ' [12] '|wc -l
1769
vm ~ (master $=) 0 $ grep msgid /tmp/tr2.json |grep -o value.*|sort|uniq -c|sort -nr|grep -v ' [1] '|wc -l
2235
vm ~ (master $=) 0 $ grep msgid /tmp/tr2.json |grep -o value.*|sort|uniq -c|sort -nr|wc -l
3261

I.e. we have ~3k messages the test suite will emit in total, and if you cut that down to the one we emit at least 4 times it's down to ~1k. Make that >=10 and it's ~700.

That seems to me to be a much better way to ask translators to translate the top N messages, i.e. right there at the top are things like:

  81984 value":"done"}
  11129 value":"alias of --%s"}
   8731 value":"Updating the following directories would lose untracked files in them:\n%s"}
   8731 value":"The following paths were already present and thus not updated despite sparse patterns:\n%s"}
   8731 value":"The following paths are unmerged and were left despite sparse patterns:\n%s"}
   8731 value":"The following paths are not up to date and were left despite sparse patterns:\n%s"}
   8731 value":"Refusing to remove the current working directory:\n%s"}
   8731 value":"Entry '%s' overlaps with '%s'.  Cannot bind."}
   8731 value":"Cannot update submodule:\n%s"}
   7945 value":"gone"}
   7944 value":"behind %d"}
   7944 value":"ahead %d, behind %d"}
   7944 value":"ahead %d"}
   7244 value":"no upstream configured for branch '%s'"}
   6725 value":"Filtering content"}
   6724 value":"Updating files"}
   5071 value":"Your local changes to the following files would be overwritten by checkout:\n%%sPlease commit your changes or stash them before you switch branches."}
   5071 value":"The following untracked working tree files would be removed by checkout:\n%%sPlease move or remove them before you switch branches."}
   5071 value":"The following untracked working tree files would be overwritten by checkout:\n%%sPlease move or remove them before you switch branches."}
   4029 value":"%shint: %.*s%s\n"}
   3641 value":"expected response end packet after ref listing"}
   3014 value":"Your local changes to the following files would be overwritten by merge:\n%%sPlease commit your changes or stash them before you merge."}
   3014 value":"The following untracked working tree files would be removed by merge:\n%%sPlease move or remove them before you merge."}
   3014 value":"The following untracked working tree files would be overwritten by merge:\n%%sPlease move or remove them before you merge."}
   2775 value":"HEAD is now at %s"}
   2694 value":"Initialized empty Git repository in %s%s\n"}
   2442 value":"Needed a single revision"}
   2149 value":"Cloning into '%s'...\n"}
   2146 value":"Marking %s as complete"}
   2135 value":"Switched to a new branch '%s'\n"}
   1956 value":"'git help -a' and 'git help -g' list available subcommands and some\nconcept guides. See 'git help <command>' or 'git help <concept>'\nto read about a specific subcommand or con>
   1954 value":"Use binary search to find the commit that introduced a bug"}
   1954 value":"Update remote refs along with associated objects"}
   1954 value":"Switch branches"}
   1954 value":"Show various types of objects"}
   1954 value":"Show the working tree status"}
   1954 value":"Show commit logs"}
   1954 value":"Show changes between commits, commit and working tree, etc"}
   1954 value":"Restore working tree files"}
   1954 value":"Reset current HEAD to the specified state"}
   1954 value":"Remove files from the working tree and from the index"}
   1954 value":"Record changes to the repository"}
   1954 value":"Reapply commits on top of another base tip"}
   1954 value":"Print lines matching a pattern"}
   1954 value":"Move or rename a file, a directory, or a symlink"}
   1954 value":"List, create, or delete branches"}
   1954 value":"Join two or more development histories together"}
   1954 value":"Fetch from and integrate with another repository or a local branch"}
   1954 value":"Download objects and refs from another repository"}
   1954 value":"Create, list, delete or verify a tag object signed with GPG"}
   1954 value":"Create an empty Git repository or reinitialize an existing one"}
   1954 value":"Clone a repository into a new directory"}
   1954 value":"Add file contents to the index"}
   1951 value":"work on the current change (see also: git help everyday)"}
   1951 value":"These are common Git commands used in various situations:"}
   1951 value":"start a working area (see also: git help tutorial)"}
   1951 value":"grow, mark and tweak your common history"}
   1951 value":"examine the history and state (see also: git help revisions)"}
   1951 value":"collaborate (see also: git help workflows)"}
   1945 value":"usage: %s\n\n"}
   1852 value":"new file:"}
   1818 value":"Switched to branch '%s'\n"}
   1740 value":" (root-commit)"}
   1739 value":"done.\n"}
   1731 value":"[new branch]"}
   1473 value":"\nCommands:\np, pick <commit> = use commit\nr, reword <commit> = use commit, but edit the commit message\ne, edit <commit> = use commit, but stop for amending\ns, squash <commit>
   1455 value":"\nIf you remove a line here THAT COMMIT WILL BE LOST.\n"}
   1442 value":"\nHowever, if you remove everything, the rebase will be aborted.\n\n"}
   1419 value":"Rebasing (%d/%d)%s"}
   1339 value":"Removing %s\n"}
   1216 value":"modified:"}
   1145 value":"branch '%s' set up to track '%s'."}
   1040 value":"From %.*s\n"}
   1018 value":"git fetch-pack: expected response end packet"}
jiangxin commented 2 years ago

I gave implementing $subject a try, after trying to clean up some old issues in the Makefile integration that date all the way back to when I originally added the i18n support.

I tried creating an RFC PR, but it kept redirecting me to git/git when trying to compare against my fork of it: git/git@master...avar:avar/Makefile-incremental-po-git-pot-rule; Also they're changes outside of po/, so should go through the ML.

@jiangxin Some of the early stuff should definitely be uncontroversial, e.g. making make pot suck less and not do reset --hard.

It's definitely a great improvement to remove the dirty part of changing files on the spot in git worktree.

But what do you think about things closer to the tip of the branch in general, i.e. getting rid of checking-in po/git.pot. Elaborated on in git/git@ba31506

I think getting rid of the "diff churn" for po/ would be great, and AFAICT the missing bits are:

  1. You wouldn't be able to update po/git.pot in your repo, instead we could either just commit a git rev-parse HEAD >po/git.pot.rev, or you could publish a ref/tag with the current OID from which to make po/git.pot.

It's ok to remove "po/git.pot" and recreate it in runtime. By introducing new rules, such as make po/zh_CN.pot, l10n team leaders can update their "po/XX.po" from the runtime version "po/git.pot" after they received "L10N" window open notifications. I can check manually or using CI pipeline to check the base version of "po/git.pot" the l10n leader used in their pull request.

  1. We'd need a tiny bit of tool juggling so that when you work on a po/LANG.po you'd do so with the version that has file:line comments, and then for submission we'd scrape those out before pushing to git.git.

It's not good to remove location lines. When there are location meta info in "po/XX.po", I can locate the original source code for reference by a single hotkey "s" when editing "po/XX.po" using emacs+po-mode. I will try to make some improvements based your branch: "Makefile-incremental-po-git-pot-rule", and send pull request to you later.

  1. I hadn't looked over git-po-helper in any detail before. A lot of it is things that really seem to belong in an extended validation tool, e.g. the typo checking, but for e.g. checkGettextIncompatibleIssues() perhaps it's something we should add to the make check-po I added in the above series.

It also checks the commit message for:

  1. Ditto for util/init.po. I.e. I hacked up a basic replacement in git/git@a1bd6ae; once that's extended a bit to give a new po/* file a canonical header.
  2. Anyway, I looked into it because for #2 above we'd need to integrate the msgmerge somehow with my new make rules. I.e. for the whole header juggling
  3. I haven't checked, but textconv filters should be able to make all of that transparent. I.e. the repo would store the version without the file:line, but we'd restore it on-demand, and on commit scrub it out.

I am afraid it not easy to restore locations without msgcat all the generate po template files with location lines. Keeping the location lines in "po/XX.po" in repo seems the only proper way. I may change my mind by reviewing/hacking your commit one by one later.

jiangxin commented 2 years ago

@avar , How about this simpler commit, which can insert before patch 2/10.

jiangxin commented 2 years ago

That seems to me to be a much better way to ask translators to translate the top N messages, i.e. right there at the top are things like:


  81984 value":"done"}
  11129 value":"alias of --%s"}
   8731 value":"Updating the following directories would lose untracked files in them:\n%s"}
   8731 value":"The following paths were already present and thus not updated despite sparse patterns:\n%s"}

It's amazing and cool to track the most used i18n message. 👍

jiangxin commented 2 years ago

The l10n CI workflow just discovered an issue for git-l10n/git-po#613 by git-po-helper. We can add "make check-po" in this workflow.

avar commented 2 years ago

It's definitely a great improvement to remove the dirty part of changing files on the spot in git worktree.

Great, will pursue that part

How about this simpler commit, which can insert before patch 2/10.

Yeah that's a good idea, I considered that & we can definitely save the space there.

One drawback of that is that we'll always be shell-ing out for generating it the 2nd time, which is slower in the Makefile than just stat-ing stuff, which it mostly has to do anyway.

But it's also just that I've been trying to get rid of shell-ing out in general in the Makefile, it's usually a bad pattern.

But for this we could also make an in-between list of files that do or don't contain PRItime.

But I thought all of that just wasn't worth it, it's around 10MB of disk space, give-or-take, and a one-off for the "pot" generation. Do you think it's OK to just keep it simpler & dumber as it is?:)

It's not good to remove location lines [for e.g. using in Emacs].

Yes I absolutely agree that would suck big time. To clarify I'm committed to giving the translators the same thing.

I'm just wondering if it's OK to not commit the lines to the repo.

That does mean that the translators need to run the equivalent of "make pot" on a checkout, but either a make target or other one-shot tool would do it for them.

Do you think that would be OK, or a no-go?

Most of what I'm trying to resolve here is "git log -p" and "git log -p -G" on on the repo before I remember to exclude the po/ dir, I.e. having the churn of lines shows up in lots of other places.

I think it also makes life easier for translators when doing git log -p on their translated file. I.e. the lines are really only needed for context for when you're working on the file in your editor.

But everything afterwards is less useful, e.g. you always have to remember to do "git log -p -G'[^#].*stash'" instead of 'git log -p -Gstash' on your XX.po if you want to find a translation with "stash" in it.

Anyway, below you say:

I am afraid it not easy to restore locations without msgcat all the generate po template files with location lines.

Hrm, maybe I'm missing something but I've been able to round-trip it so far.

I.e. the msgmerge code doesn't seem to rely on them from merging, and even if it did in the future it would be for obscure edge cases, since it de-dupes on msgid anyway (I really can't imagine how it would use it).

But simply generating a new git.pot and msgmerge-ing it in seems to work (exactly as happens now), then just msgcat-ing at the end to get rid of the line numbers with the --no-location option.

It also checks the commit message for:

Oh yes, I didn't mean to suggest to replace it all.

Just for some of this build process stuff (e.g. make check-po) it's handy to bring /some/ of it in, because we'll want subsets of it in git.git's CI and the like, without having to depend on the external tool.

It's amazing and cool to track the most used i18n message. +1

Yeah, handy. I was hoping that the gettext tooling would haves some way to define a sort order for messages so we could arrange the files like that (at least new ones), but alas it seems if we wanted that it's entirely manual...

avar commented 2 years ago

But it's also just that I've been trying to get rid of shell-ing out in general in the Makefile, it's usually a bad pattern.

To elaborate a bit make semantics also aren't ideal here, i.e. you can do:

VAR := $(shell ...)

Which will be unconditionally expanded, so all rules pay the cost, or:

VAR = $(shell ...)

Which is only needed "on demand", but then if you use make -jN you'll get N processes all running the same command.

avar commented 2 years ago

Hrm, maybe I'm missing something but I've been able to round-trip it so far.

I mean (on master):

msgcat --no-location po/de.po >tmp && mv tmp po/de.po
git add -u
git commit -m"de without locations"
perl -pi -e 's/hinzuge/blahzuge/g' po/de.po
git add -u
git commit -m"changed some translations"

Then:

make pot
cd po
msgmerge --add-location --backup=off -U de.po git.pot
git add -u
git commit -m"bump from pot"

You'll get the version with all the line numbers added, and from looking at it it merged everything correctly, likewise with --no-location. In that case (because I changed the length of the message) it'll even re-wrap some of them:

diff --git a/po/de.po b/po/de.po
index 2280b2f7370..7e419b83b04 100644
--- a/po/de.po
+++ b/po/de.po
@@ -4015,8 +4015,9 @@ msgid ""
 "CONFLICT (file location): %s added in %s inside a directory that was renamed "
 "in %s, suggesting it should perhaps be moved to %s."
 msgstr ""
-"KONFLIKT (Speicherort): %s blahzugefügt in %s innerhalb eines Verzeichnisses, "
-"das umbenannt wurde in %s, es sollte vielleicht nach %s verschoben werden."
+"KONFLIKT (Speicherort): %s blahzugefügt in %s innerhalb eines "
+"Verzeichnisses, das umbenannt wurde in %s, es sollte vielleicht nach %s "
+"verschoben werden."

 #, c-format
 msgid ""
@@ -4260,8 +4261,8 @@ msgstr ""
 #, c-format
 msgid "CONFLICT (rename/add): Rename %s->%s in %s.  Added %s in %s"
 msgstr ""
-"KONFLIKT (umbenennen/hinzufügen): Benenne um %s->%s in %s. %s blahzugefügt in "
-"%s"
+"KONFLIKT (umbenennen/hinzufügen): Benenne um %s->%s in %s. %s blahzugefügt "
+"in %s"

 #, c-format
 msgid "%s is a directory in %s adding as %s instead"

So it really seems to me like we can just remove these from being checked in with some tiny bit of tooling which does the msgmerge [--add-location|--no-location] / msgcat --no-location dance at the right times.

Which would all be nice and easy-to-use make targets.

jiangxin commented 2 years ago

But it's also just that I've been trying to get rid of shell-ing out in general in the Makefile, it's usually a bad pattern.

To elaborate a bit make semantics also aren't ideal here, i.e. you can do:

VAR := $(shell ...)

Yes, for this case the simpler expended variable is better than recursively expanded variable.

Update: fixed in commit: jiangxin/git@ec62ec9f5eb4f4e88442669c8878b71ebe4eede1

jiangxin commented 2 years ago

Add another commit^1 in "review/avar" branch^2 which add new pattern rule to help l10n team leader update their po/XX.po.

jiangxin commented 2 years ago

I installed a custom diff driver to generate pretty diffs for po files. You can find the driver in contrib dir:

So the file locations in the po file never bothered me.

Update: we can add this tips in po/README.md

avar commented 2 years ago

Yes, for this case the simpler expended variable is better than recursively expanded variable.

Not really, as noted upthread about := v.s. '='. I.e. now the cost of the := will be paid by any invocation on make. Even from cache it takes around ~40ms for me.

What problem are you trying to solve there? The disk usage? If so a much smaller change is to pull the same "ln or cp" trick as can be seen in the QUIET_LNCP callers.

If the marginal speed-up to avoid the cp that doesn't seem worth it (and to the extent that we care an ln would make it faster).

avar commented 2 years ago

I should also say: The rest of what you mentioned in your updates sounds really interesting, much more interesting than me nitpicking about make invocations :)

I just didn't have time to look at it all yet, and might not in the coming days...

I'd be really happy if you wanted to just pick this up & send whatever you think is appropriate to the mailing list in due time, even if that's just part of this (e.g. the not-reset-hard make, but not the don't-add-line-numbers). We can always pursue the rest sometime later.

Right now git's in the rc period, so I'm likely to get distracted by something else, and even entirely forget about this until months in the future unless I'm reminded of it :)

jiangxin commented 2 years ago

Will pick up this thread after this release cycle is complete.

avar commented 2 years ago

Will pick up this thread after this release cycle is complete.

I gave it a try to make that shell rule with your optimization, but without shelling out. The result is a bit crazy, but works :): https://github.com/avar/git/commit/4be659ffbf9ba1e96e11dff32805a3e62374ba5e

jiangxin commented 2 years ago

Very busy these days, will pick up this issue next week.

jiangxin commented 2 years ago

Follow up discussion at this link: https://lore.kernel.org/git/20220526145035.18958-1-worldhello.net@gmail.com/