Add support for translation comments

mgeisler commented 1 year ago

We should teach mdbook-xgettext to react to comments in the Markdown. Comments could be used for:

Marking the next paragraph or code block as non-translatable. This could save translators a lot of work if there are many code blocks which shouldn't be translated because they don't have any strings or comments.
Add translator comments. Gettext supports comments on each message and xgettext extracts them using a special syntax (which we should try to replicate).
Add a message context. This is again s Gettext feature: the same message can appear multiple times, but with s different context each time.

dyoo commented 1 year ago

I'm interested in working on this one.

dyoo commented 1 year ago

Questions:

I'm assuming we'll be looking at HTML comments?

Do you already have some expectations on what the comment structure should look like for this operations?

To mark the next paragraph as non-translatable, for example, maybe: 
To add a translator comment, maybe <!-- TRANSLATOR: ... ->
I don't know what a message context is quite yet: I will need to read more about what "s Getttext feature" means. :)

I'm expecting that we'll want to expand the return type for extract_messages(), so that we can handle the translator comments or message contexts. Does that match your expectations as well?

mgeisler commented 1 year ago

Questions:

I'm assuming we'll be looking at HTML comments?

Yes, precisely! They are our universal back-channel :slightly_smiling_face:

Do you already have some expectations on what the comment structure should look like for this operations?

To mark the next paragraph as non-translatable, for example, maybe: 

Yeah, but I would prefix it with mdbook-xgettext: to make it clear that this applies to this particular program:

<!-- mdbook-xgettext: skip -->

```rust,editable
fn do_not_translate_me() {}


> * To add a translator comment, maybe `<!-- TRANSLATOR: ... ->`

This actually matches [this SO answer](https://stackoverflow.com/a/75947520/110204) which suggests calling `xgettext` as

xgettext --add-comments=TRANSLATORS


to extract C comments like this:

```c
// TRANSLATORS: A translator comment
printf(_("Hello world"));

As you might have guessed, our mdbook-xgettext plugin plays the role of xgettext from the normal Gettext suite. Since xgettext is configurable, we might want to make our keywords configurable as well. The configuration would be read from the book.toml file, similar to how we read pot-file today.

I don't know what a message context is quite yet: I will need to read more about what "s Getttext feature" means. :)

The GNU project has documentation here, which also mentions the message context: https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html

I'm expecting that we'll want to expand the return type for extract_messages(), so that we can handle the translator comments or message contexts. Does that match your expectations as well?

That sounds like a good idea — I've not really thought about how this will affect the code. It might be time to expand the return type to a bigger type, perhaps our own or perhaps the polib Message type.

dyoo commented 1 year ago

Okay, I think I understand. I'm starting to hack to see if I can get a proof of concept going. Here's what I'm thinking of for handling skip, to do it at the place where we're gathering groups for translation:

https://github.com/google/mdbook-i18n-helpers/compare/main...dyoo:mdbook-i18n-helpers:skip

This is not heavily tested or quite right yet. But I hope it's in a good direction!

dyoo commented 1 year ago

Okay, I did a little bit more work on cleaning up the skip-handling code. https://github.com/google/mdbook-i18n-helpers/pull/69. I hope this looks good to you!

mgeisler commented 1 year ago

Your #69 looks great! I wanted to copy a comment from there to here since I think it can be useful for future PRs. Basically, we can take inspiration from what mdpo does.

I think we could use i18n as a shorthand for mdbook-i18n-helpers and implement these comments:

i18n-skip: this PR, skips next event,
i18n-disable: starts skipping events,
i18n-enable: stops skipping,
i18n-comment: translator comment,
i18n-context and i18n-ctx: sets message context for next (or following?) event.

The i18n- prefix is hopefully special enough that people realizes that these comments are special.

An immediate use case of this feature is skipping large code blocks in Comprehensive Rust. A related feature would be skipping code blocks automatically by default. A twist on this would be skipping code blocks that don't contain certain (configurable) patterns: if we don't see // or " in a code block, then we should not copy it to the PO file — at least for Comprehensive Rust where we only translate strings and comments.

dyoo commented 1 year ago

Overall next plan sketch:

Expand Group enum from tuple to struct, so that Translate can carry an optional translator comment along with the markdown events.
- I'm assuming that if there are multiple translator comment HTML comment directives, back-to-back, that we should concatenate, since polib::Message has room only for a single one.
Adjust group_events to watch for translator comments and to stuff them into the grouping context context.
Adjust extract messages to emit polib::messages.
- Perhaps lifting mdbook-xgetext::add_message into the lib to make that part easier to reuse, and adjust callers to extract_messages accordingly.

mgeisler commented 1 year ago

Overall next plan sketch:

The changes in your plan look great to me!

However, may I suggest a small detour: could we add a command line option to automatically skip code blocks which don't contain " or //? That would allow us to remove a lot of code blocks from the PO files today (thanks to the skip functionality already built).

I'm assuming that if there are multiple translator comment HTML comment directives, back-to-back, that we should concatenate, since polib::Message has room only for a single one.

That sounds very reasonable! There can be normal # comments anywhere in a PO file, but we're not trying to model those. We're only trying to capture special #. extracted comments as I just learnt they're called in the Gettext documentation. So it makes sense to concatenate multiple such extracted comments into a single comment (probably concatenate with \n\n?).

dyoo commented 1 year ago

However, may I suggest a small detour: could we add a command line option to automatically skip code blocks which don't contain " or //? That would allow us to remove a lot of code blocks from the PO files today (thanks to the skip functionality already built).

Your wish is my command. Draft: https://github.com/google/mdbook-i18n-helpers/pull/75

I need to revise it since it's being applied unconditionally here instead of as a command line option. Do you have suggestions on what that command line option should look like?

mgeisler commented 1 year ago

I need to revise it since it's being applied unconditionally here instead of as a command line option. Do you have suggestions on what that command line option should look like?

I added a bit about that to #75 — we can pass in options via book.toml and there we have free hands to create our own config keys. Something like

[output.xgettext]
skip-all-codeblocks = true
skip-codeblocks-except-contains = ["\"", "//"]

Perhaps the logic should be reversed? We could instead speak about "keeping code blocks" and use settings like

[output.xgettext]
keep-codeblocks = true
keep-codeblocks-matching = ["\"", "//"]

That way we have fewer double-negations in the mental model :smile:

The two options overlap, but I guess people will appreciate the clarity of writing keep-codeblocks = false over keep-codeblocks-matching []... Internally, one could translate keep-codeblocks = true into keep-codeblocks-matching = ["."].

dyoo commented 1 year ago

Okay, notes to self on implementation strategy here.

We want to read from the configuration object. I'm assuming that we want to extract from the Value section of the Config in an MDBook somehow.

It's a little unclear to me how the data flow works here, how we get that value threaded to the message extractor. Reading... okay, the main function of mdbook-xgettext.rs gets the RenderContext value.

I'm assuming that the places where we are calling extractMessages are all places where we can pass along a ctx: RenderContext. Let me check.

There are two places where we call extractMessages:

mdbook-i18n-normalize, in SourceMap.extract_messages. Not quite obvious where RenderContext is supposed to come from for that binary.
mdbook-xgettext, where it is obvious where RenderContext can be found.

So I think I may need a little more guidance here since I'm unfamiliar with the usage of mdbook-i18n-normalize; is it meaningful to have the RenderContext available when using that binary?

mgeisler commented 1 year ago

So I think I may need a little more guidance here since I'm unfamiliar with the usage of mdbook-i18n-normalize; is it meaningful to have the RenderContext available when using that binary?

Good question... on one hand, mdbook-i18n-normalize probably doesn't have to know about translation comments: it's purpose it to do a PO -> PO transformation, in particular to pick up new breaking changes such as the one I made in version 0.2 (#25, #27). It is not needed to implement skipping messages: they will automatically be removed from the PO files when people regenerate the POT file and then run msgmerge.

On the other hand, it might be good to do the filtering after extract_messages. Let extract_messages be the canonical low-level function which turns Markdown into smaller pieces of Markdown and let another function filter and post-process these messages. That would avoid putting too much responsibility into this one function.

dyoo commented 1 year ago

Yeah, I was also thinking that I'm doing too much at the extract_message level there. Let me see if I can refactor things a bit so that we cleanly separate the filtering as a separate step.

dyoo commented 1 year ago

Apologies for not giving a status report for a few weeks; I caught a nasty bug a few weeks ago that has been sapping a lot of energy from me.

I've begun the work on refactoring the directive detection so that we can

conditionally use a different prefix
support translation comments

https://github.com/google/mdbook-i18n-helpers/compare/main...dyoo:mdbook-i18n-helpers:refactor-messages?expand=1

Still in progress and being cleaned up, but hopefully I'll get this ready for review shortly. I Just wanted to give you a heads up @mgeisler and to let you know I'm still alive. :)

dyoo commented 1 year ago

https://github.com/google/mdbook-i18n-helpers/pull/107 is the second part to thread translation comments up to extract_messages. I need to teach mdbook-xgettext about these comments; doing this next.

mgeisler commented 1 year ago

Still in progress and being cleaned up, but hopefully I'll get this ready for review shortly. I Just wanted to give you a heads up @mgeisler and to let you know I'm still alive. :)

Yay, good to see you're still here :smile: Thanks for putting up the PRs!

google / mdbook-i18n-helpers

Add support for translation comments #63