HTML tag abbr seems to confuse the parser

HidenoriKobayashi commented 11 months ago

I'm redirected to here from https://github.com/google/comprehensive-rust/issues/1284.

The source: https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/src/bare-metal/microcontrollers/probe-rs.md

The po file (bit outdated but updating it does not resolve the issue): https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/po/ja.po#L13420

Now, if I make this change (just for testing purpose)

diff --git a/po/ja.po b/po/ja.po
index b2baaf215743..2a9eb0c872d4 100644
--- a/po/ja.po
+++ b/po/ja.po
@@ -13418,7 +13418,7 @@ msgstr ""

 #: src/bare-metal/microcontrollers/probe-rs.md:10
 msgid "`cargo-embed` is a cargo subcommand to build and flash binaries, log "
-msgstr ""
+msgstr "`cargo-embed`"

 #: src/bare-metal/microcontrollers/probe-rs.md:11
 msgid "RTT"

I get a result like this: localhost_3000_bare-metal_microcontrollers_probe-rs html

The strange thing about this is that string cargo-embed got merged into the last item in the list.

mgeisler commented 11 months ago

Thanks @HidenoriKobayashi!

I think there might be two problems here. First, we have this test, which demonstrates that we chop up the text when we encounter a HTML tag (including a HTML comment):

https://github.com/google/mdbook-i18n-helpers/blob/941a188219c5b0492850ad54ea41303dcee49c82/i18n-helpers/src/lib.rs#L745-L754

There is also this test, which demonstrates that we lose text immediately following a (block-level?) HTML tag:

https://github.com/google/mdbook-i18n-helpers/blob/941a188219c5b0492850ad54ea41303dcee49c82/i18n-helpers/src/lib.rs#L936-L950

I think it's an open question if we could or should extract the HTML into the POT file. One concern would be if we should just extract foo for something like <abbr title="foo">bar</abbr>? That would probably require us to know that the title attribute has translatable text (which is doable). We would still be chopping up the text — which I guess is nicer than asking everybody to deal with HTML in the translation files?

This is only about what gets extracted and put into the POT file. The other half of this is how we weave things together when we translate the Markdown later.

I just tried adding a little unit test for mdbook-gettext:

    #[test]
    fn test_translate_html() {
        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
        assert_eq!(
            translate("foo <b>bar</b> baz", &catalog),
            "FOO <b>BAR</b> BAZ"
        );
    }

it fails with

---- tests::test_translate_html stdout ----
thread 'tests::test_translate_html' panicked at i18n-helpers/src/bin/mdbook-gettext.rs:239:9:
assertion failed: `(left == right)`

Diff < left / right > :
<FOO <b>BAR</b>BAZ
>FOO <b>BAR</b> BAZ

which shows that we lose a ' ' immediately following the </b>!

mgeisler commented 11 months ago

Further investigation shows that extract_events(" BAZ", state) give a Text(Borrowed("BAZ")) back! It should really give " BAZ" back so this seems to be where the space is lost.

mgeisler commented 11 months ago

This is actually fair: Markdown does not distinguish between a line of text starting with " foo" vs "foo". So when we extract the events, we lose the leading whitespace.

I'm not entirely sure where where we should account for this, but perhaps extract_events could add it back in? Or perhaps the caller should account for being in an inline context and then add it? Note that we have a pulldown_cmark_to_cmark::State object available and it knows that last_was_html... so that might be helpful here?

mgeisler commented 11 months ago

So I think these two tests should work:

diff --git a/i18n-helpers/src/bin/mdbook-gettext.rs b/i18n-helpers/src/bin/mdbook-gettext.rs
index b686be4..291bb2c 100644
--- a/i18n-helpers/src/bin/mdbook-gettext.rs
+++ b/i18n-helpers/src/bin/mdbook-gettext.rs
@@ -233,6 +233,15 @@ mod tests {
         );
     }

+    #[test]
+    fn test_translate_html() {
+        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
+        assert_eq!(
+            translate("foo <b>bar</b> baz", &catalog),
+            "FOO <b>BAR</b> BAZ"
+        );
+    }
+
     #[test]
     fn test_translate_table() {
         let catalog = create_catalog(&[
diff --git a/i18n-helpers/src/lib.rs b/i18n-helpers/src/lib.rs
index 3bf2a5f..484d348 100644
--- a/i18n-helpers/src/lib.rs
+++ b/i18n-helpers/src/lib.rs
@@ -590,6 +590,30 @@ mod tests {
         assert_eq!(extract_events("", None), vec![]);
     }

+    #[test]
+    fn extract_events_leading_whitespace() {
+        assert_eq!(
+            extract_events("  foo", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("  foo".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
+    #[test]
+    fn extract_events_trailing_whitespace() {
+        assert_eq!(
+            extract_events("foo  ", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("foo  ".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
     #[test]
     fn extract_events_paragraph() {
         assert_eq!(

mgeisler commented 8 months ago

@kdarkhan, the problem here is about how we lose a ' ' when the Markdown AST is a mix of Text and Html nodes. I believe #125 is related: it's about how breaking apart the text on HTML elements causes problems. I think we could solve both problems if we would let Text and Html nodes be grouped together.

kdarkhan commented 3 months ago

I believe this should be resolved now with https://github.com/google/mdbook-i18n-helpers/pull/195 merged.

mgeisler commented 3 months ago

I think so too, let's close it!

google / mdbook-i18n-helpers

HTML tag abbr seems to confuse the parser #97