google / mdbook-i18n-helpers

Translation support for mdbook. The plugins here give you a structured way to maintain a translated book.
Apache License 2.0
127 stars 25 forks source link

Extract only strings and comments from code blocks #95

Closed mgeisler closed 10 months ago

mgeisler commented 11 months ago

As a further step after #75, we should offer an option to only extract literal strings and comments from the code.

For this example:

fn pick_one<T>(a: T, b: T) -> T {
    if std::process::id() % 2 == 0 { a } else { b }
}

fn main() {
    println!("coin toss: {}", pick_one("heads", "tails"));
    println!("cash prize: {}", pick_one(500, 1000));
}

we would end up with just four small strings

in the POT file.

This would require us to process Tag::CodeBlock in a more fine-grained way, but I think it could be worth it.

The fun part would be to find a cross-language solution. I suspect our best bet would be to use a syntax highlighting library: they normally detect strings and comments and so such a library should have the necessary machinery.

0scvr commented 11 months ago

Some crate like syntect would probably work nicely. Example: https://docs.rs/syntect/latest/syntect/parsing/struct.SyntaxSet.html#method.find_syntax_by_token.

mgeisler commented 11 months ago

Yeah, exactly! I'm hoping that we can use that library to get the byte position of string literals and comments. I don't have any experience with syntect, but I hope you can look at it.

dalance commented 10 months ago

I have tried to use syntect at #109. It seems to work fine.

0scvr commented 10 months ago

I have tried to use syntect at #109. It seems to work fine.

That's awesome! It will probably reduce the line count in messages.po by a good margin.