Extract only strings and comments from codeblock

dalance commented 10 months ago

Fixes #95

This PR adds codeblock parsing support to extract only strings and comments.

google-cla[bot] commented 10 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

mgeisler commented 10 months ago

Hi @dalance,

This PR adds codeblock parsing support to extract only strings and comments.

Wow, thank you so much! This looks really good from a quick look!

Could you test how https://google.github.io/comprehensive-rust/zh-CN/hello-world/small-example.html looks after this? In particular, does mdbook-i18n-normalize do the right thing here? It should turn this message into 5 individual messages, one for each line comment.

@dyoo has written a lot of this code and I hope he can review this. Also, I think this will collide with #107 and the design he started there to support things like translation comments. So we should make sure we can get everything working nicely together here :slightly_smiling_face:

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

dalance commented 10 months ago

Hi @mgeisler, Thank you for your review. I fixed some codes which are suggested.

The small-example.md becomes like below. I think it is an expected result.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr ""

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr ""

#: src/hello-world/small-example.md:8 src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr ""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr ""

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr ""

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr ""

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr ""

The conflicting point with #107 seems to be the modifitation to Group enum. It may be better that splitting this PR to changing to Group which owns Event and adding parse_codeblock function?

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly. So an option to disable this feature may be useful for existing users.

dalance commented 10 months ago

I changed some logics to treat continuous line comments. This is because the previous code can't handle indented continuous line comments.

mgeisler commented 10 months ago

Thanks for all the updates!

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly. So an option to disable this feature may be useful for existing users.

Yeah, I can imagine it will change the files a lot — but for the better!

Please consider adding a line about this new behavior to the documentation. I think it could just be a little paragraph in Marking Sections to be Skipped for Translation which says something like

Note that we don't extract the full text of code blocks. Only text that is recognized as comments and literal strings is extracted.

Something like that, feel free to rephrase it how you like.

The small-example.md becomes like below. I think it is an expected result.

Yes, this looks great! Please also try running

mdbook-i18n-normalize po/zh-CN.po po/zh-CN-normalized.po

and verify that the existing translations are correctly paired up with the new messages. The normalization tool is what is supposed to let us change how we extract messages while also keeping existing translations intact. If I did things correctly, it should "just work", but it'll be nice to have confirmation from you :slightly_smiling_face:

dalance commented 10 months ago

The result of mdbook-i18n-normalize is below. It works fine too.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr "// 程序入口\n"

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr "// 可变变量绑定\n"

#: src/hello-world/small-example.md:8
#: src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr "\"{x}\""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr "// 与 printf 类似的输出宏\n"

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr "// 表达式周围没有括号\n"

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr "// 与其他语言类似的数值计算\n"

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr "\" -> {x}\""

mgeisler commented 10 months ago

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in https://github.com/google/mdbook-i18n-helpers/pull/109#issuecomment-1784795489). I'll make sure to explain this again in the release notes.

dalance commented 10 months ago

I found a normalization issue. The result of re-executed nomalization to po/zh-CN-normalized.po is here.

#: src/hello-world/small-example.md:6
msgid "// Program entry point"
msgstr "// 程序入口"

The trailing linebreak is missing. This is because msgid/msgstr don't have a context which is in codeblock or not. I can add a workaround like below, but hand-written detection is required because language is not clear.

pub fn extract_messages(document: &str) -> Vec<(usize, String)> {
    if document.starts_with("//") && document.ends_with("\n") {
        return vec![(1, document.to_string())];
    }

How do you think the workaround? Or any information which shows msgid/msgstr is originated from codeblock should be added to po file?

dalance commented 10 months ago

I fixed the reviewed issues. Additionally, I found my mistake about the usage of syntect stack through fuzz test, and fixed it too.

dalance commented 10 months ago

I think adding codeblock flag to po file like below to resolve the normalization issue may be better. It seems to work fine at my codebase.

#: src/hello-world/small-example.md:6
#, codeblock
msgid "// Program entry point"
msgstr "// 程序入口"

henrif75 commented 10 months ago

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

dyoo commented 10 months ago

I am so sorry for late review comments; I thought I pressed Submit, and it turns out that I did not. :(

dalance commented 10 months ago

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

dalance commented 10 months ago

My solution seems to be broken... msginit can't handle flags in messages.pot. Another solution may be required.

dalance commented 10 months ago

Built-in flags seems to be passed through msginit. So using c-format flag instead of codeblock flag may resolve the issue. But it may be confusable.

mgeisler commented 10 months ago

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

If this is only for normalization, then please don't add extra code to support it — normalization is something translators (or @henrif75) will run only once after a new breaking release. I don't think the main code should be expanded to support it.

In this case, I believe the problem are some newlines which are incorrect after running normalize? As long as the normal flow works, then these messages will be corrected by msgmerge and that's good enough.

@dyoo, are you happy with the logic here? Please finalize the review so we can merge it.

dalance commented 10 months ago

If there is no need to resolve the normalization issue, I'll remove the logic. I think migration from v0.2 to v0.3 will works fine without it.

dalance commented 10 months ago

I removed the logic related to Flags.

mgeisler commented 10 months ago

If there is no need to resolve the normalization issue, I'll remove the logic.

Thanks @dalance!

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

Hey @henrif75, yes, it will affect PRs that touches code blocks. Most PRs don't, I believe so the impact should be less than when the tooling learnt to ignore most Markdown formatting in version 0.2.

Note that there will only be conflicts if we run the normalization tool on a translation — until we do, the translation will work just fine in all places except for the code blocks.

So I suggest that we land the current in-flight PRs, run the normalization tool, and then people can continue translating like normal (but with much smaller PO files: a quick test shows that we go from 19k to 16k lines!).

dalance commented 10 months ago

I restored heuristic logic for codeblock which doesn't have lang-specifier, and refactored stack operation logic.

mgeisler commented 10 months ago

Amazing, thanks @dyoo for the review and thanks @dalance for all the work here! I'm very excited to merge and release this to the world :smile:

mgeisler commented 10 months ago

Sorry for the tiny merge conflict in lib.rs, @dalance :slightly_smiling_face: We should be able to merge this right after you resolve it.

dalance commented 10 months ago

I resolved the conflict.

google / mdbook-i18n-helpers

Extract only strings and comments from codeblock #109