rust-lang / mdBook

Create book from markdown files. Like Gitbook but implemented in Rust
https://rust-lang.github.io/mdBook/
Mozilla Public License 2.0
17.67k stars 1.61k forks source link

Add multilingual support #5

Open azerupi opened 9 years ago

azerupi commented 9 years ago

Add support for multiple languages.

FuGangqiang commented 9 years ago

multiple languages for document?

azerupi commented 9 years ago

Yes, I think Gitbook does support something like that.

Instead of having the markdown files directly in the source folder you would have some sub folders like this:

src/
├── de
├── en
└── fr

And their would be an easy way to change the language in the rendered book.

It's definitely something I would like to add, but it's not the highest priority at the moment

azerupi commented 8 years ago

Multiple designs possible:

mkpankov commented 8 years ago

I don't think one SUMMARY.md for everything is a good idea. I consider consistency within translated version more important than consistency with original. Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me, even in case there's single SUMMARY. Words have different length in different languages, and in Russian translation we consistently have sentences that are noticeably longer than original. But I'd love to have it so that one click can show the same point in text in original language.

I think this can be handled by tracking 1-to-1 mapping of paragraphs - sections aka markdown files are too big. Paragraphs also seem a good candidate because sentences get paraphrased and reordered sometimes, but the paragraphs stay in same order and have same gist.

azerupi commented 8 years ago

Thanks for the input! I really appreciate the feedback :)

Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane :wink:).

Let's take a hypothetical situation with the Rust book. Let's say I am reading a blog post and it references some chapter in the Rust book, for example the chapter about ownership. But English is not my main language and it would be a lot easier to understand the chapter in my native language. If we have 1 to 1 mapping on page / chapter level the user could then select his language (if it is supported) from a dropdown menu and he would land on the exact same page in his chosen language.

However for this to work correctly we need a guarantee that every page in one language has an equivalent page in the other language. If you allow a different SUMMARY.md per language there is no way to know what pages are equivalent if any equivalent page even exists at all.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Of course, I totally agree with you. But the SUMMARY.md is only about structure, so what order the chapters come in, not the content.

If there is one SUMMARY.md for all languages I think it will only cause trouble if:

  1. New chapters get added, as equivalent chapter in other languages will just be blank until they are translated
  2. The markdown files get renamed, this should not happen often when it does it is not difficult to rename the files accordingly for every language
  3. A reorderering of the chapters where the continuity of the content is broken. This too should not happen often, but it's more challenging to fix as it requires the translators to translate the text that changed

To be honest, once a book has it's definitive structure the SUMMARY.md is not likely to change often unless there is a major rewrite being done.

I think both designs have advantages and drawbacks, we need to figure out which one we want / need the most.


Idea for Rust book workflow when translations are in tree

When / if translations are moved into the official repository we could create a more elaborate pull request process. This is only an idea, it may be flawed :wink:

When a pull request is made that contain changes that need translation (e.g. not typos) we could wait to merge the pull request until translations have been made for all officially supported languages.

The pull request could track what translations have been made using a check list like this:

Once all the translations are ready the pull request is merged in. Officially supported languages could be languages with a minimum number of "official" maintainers.

This would add a little / lot of overhead for the english version but it would solve the two big issues with translations.

  1. Translations would always be up to date!
  2. This is probably the easiest way to track changes

There may be organizational problems I haven't considered though. @steveklabnik

steveklabnik commented 8 years ago

The biggest problem with blocking English changes to non-English changes is that I am paid for my work, but others are not. This places a big burden on them; I'm gonna want to land changes ASAP, and that's not fair to people who can't do this as a day job.

azerupi commented 8 years ago

That's true, didn't think of that. It could still be applied without blocking the English changes? Just for tracking. Not sure if it's worth the overhead though.

Anyways, do you have a preference for any of the two design choices (one vs. multiple SUMMARY.md)?

steveklabnik commented 8 years ago

I think I prefer a single for the reasons you've stated, but since I'm not doing the translations themselves, I don't think my opinions matter much :)

And yeah, tracking might be different/better than actually blocking on them landing.

mkpankov commented 8 years ago

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane :wink:).

Ok, I think what I was trying to say but couldn't get across is this: page-to-page mapping isn't enough for printed versions, as same pages will have different content. And if by page you meant a web page, that is not enough either. Some sections (pages) are tens of screens long, and to provide smooth transition from one version to another we should track smaller units than entire files (web pages).

I originally thought you were talking about printed pages and written the following, but I'm not sure now. For printed versions, depending on length of the section and sentence-length difference with the original, this can very from "I see not the beginning of the paragraph that talks about Foo feature, but the end" to "I don't see the paragraph that talks about Foo feature on screen at all", when linked to "page 83 of PDF".

So let's clarify the terms before continuing as apparently I misunderstood something :smile:

azerupi commented 8 years ago

Ok yes, I will try to do my best to explain what I envision:

So in this issue I am not at all talking about tracking any changes for translations, only about how to support multiple languages in the same folder / book.

Before I continue, let's explain what the SUMMARy.md does exactly.

When you render the book (mdbook build) it is going to search for the SUMMARY.md and parse it. The SUMMARY dictates

That is the "only" information we get from the SUMMARY.md

If we want to support multiple languages for one book, there are two possible designs (that I thought off):

Let's see both in more details.

One SUMMARY.md for all languages

Consider this SUMMARY.md for a book:

# Summary

- [hello world](hello-world.md)
- [second chapter](second-chapter.md)

and this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── fr
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── ru
    │   ├── hello-world.md
    │   └── second-chapter.md
    └── SUMMARY.md

As you can see here, every language has the same markdown files defined in the global SUMMARY.md. This means that the "hello world" chapter has a corresponding page in every language! (1 to 1 mapping)

Advantages

Having a guarantee that every chapter in one language has a corresponding chapter in another language gives us the possibility to change the language from any chapter and land on that same chapter in the other language.

Example: I am reading the "borrowing" chapter of the Rust book. I want to see that same chapter in French. I just select "French" from the dropdown button in the menu-bar and I will land on the French version of the chapter.

Drawbacks

When the SUMMARY.md is modified it can cause some consistency problems in the translations because changes in the SUMMARY.md will be reflected immediately in all languages. However, changes in the SUMMARY.md should be relatively rare once the book has found it's "final" structure.

Problems that could occur:

Content is not modified by the SUMMARY.md so any of the designs here is not going to cause any trouble with the content if the SUMMARY.md is modified.

Another drawback is that I am not sure yet how translations will give a translation for the chapter titles in the sidebar (SUMMARY.md). Maybe just take the first heading from the corresponding markdown file?

One SUMMARY.md for EVERY language

Let's consider this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   ├── second-chapter.md
    │   └── SUMMARY.md
    ├── fr
    │   ├── hello-world.md
    │   └── SUMMARY.md
    └── ru
        ├── hello-world.md
        ├── second-chapter.md
        └── SUMMARY.md

As you can see here, every language has it's own SUMMARY.md and thus can define the order of their chapters and the markdown files as they wish.

There is absolutely no more guarantee that the French version contains the same chapters as the English version. No 1 to 1 mapping. Essentially every language is its own separate book, they could have exactly the same structure or they could have totally different chapters. There is no way for the program to know that.

It is thus impossible to change the language from a chapter. You would have navigate to the French version manually and search for the chapter you were reading if it exists in the French version at all!

Advantages

Translations have a lot more freedom, but this can also be seen as a drawback. Translations do not need to have the same structure, so when the SUMMARY.md is changed in the English version, absolutely nothing is going to change in the other languages. Every change in the translations has to be done manually.

Drawbacks

There is no guarantee that a chapter in one language as an equivalent in another language.(No 1 to 1 mapping) The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.


I hope this made it more clear, if there is still something you don't understand I can elaborate more on some specific area. :wink:

EDIT: A little quote from a response I made on Rust's internals forum:

And to be honest, if you have different TOCs you essentially have different books. There is little gain to support that, other than being able to group all the translations in one directory and build them in one go.

You can already group the multiple translations in one directory as different books each with it's own SUMMARY.md and book.json and if you configure the source and destination directories correctly there should be minimum trouble to integrate with automatic deployment scripts etc.

defuz commented 8 years ago

There is no guarantee that a chapter in one language as an equivalent in another language.

Regarding Rust Book translation process, it is not disadvantages of some solution, but simply a fact. I think that the other projects that will use mdBook with multiple languages will have the same problem.

The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.

Can we make it simple and assume that the files with the same name in different languages are the same chapter? Then we can give the opportunity to switch to another language. I think this approach will satisfy both cases:

  1. When there is complete consistency between all languages.
  2. When consistency between languages is not complete.
defuz commented 8 years ago

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English. I think we should not assume that the reader is familiar enough with the language of original to understand the chapter titles.

azerupi commented 8 years ago

When consistency between languages is not complete.

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English.

Of course that was not the plan, I just hadn't found a good solution for it yet so I didn't discuss it too much

defuz commented 8 years ago

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Why not? We can clearly indicate that the translation for this chapter is not available yet. Another possible situation is that translation for some languages is available, but for other languages it's not.

defuz commented 8 years ago

Another example that I care about.

Let's compare the structure of the section "Getting started" in the nightly and stable books. As you can see, Steve joined 4 chapters into one. Imagine that not all the language versions supported this change yet. If we have common TOC, this means that there is no possibility to open "Installing Rust", "Hello World" and "Hello Cargo" chapters in non-English version of book, because they do not exist in the original TOC anymore.

azerupi commented 8 years ago

Yes I totally agree with you! This would be a big problem. However I am not sure I want to settle with the solution Gitbook proposes either. Maybe we can come up with something better that combines all the advantages and none of the drawbacks? (even if it's a little more complex)

Gitbook uses the "one SUMMARY.md per language" method and to be honest I don't think it is real multilingual support. They essentially have one book per language no cross-linking between the different languages except on a landing page...

I think you could already achieve something very similar with mdBook with multiple books and configuring the source and output directories according to what you want. The only difference is that Gitbook makes it just a little bit easier to setup.

defuz commented 8 years ago

My suggestion is to have "one SUMMARY.md per language", but support page-to-page cross-linking between the different languages. The easiest way to do this is to consider that the files with the same name are the same chapters. In 99% this should work. A more complex way to do this is to add some kind of identifier to each file (something like UUID). If the identifiers of the files are identical, we can cross-link them.

azerupi commented 8 years ago

Hmm yes that might be a good compromise. At least if the translations don't diverge to much from the original. I will try to think about this a little more and see if I can come up with other ideas.

Thanks for the valuable input! :)

mdinger commented 7 years ago

FWIW, there are tools to handle translations which I didn't see mentioned here yet. For example, crowdin is used (or was when I was involved) over at freecad for document translation of their wiki. It was noteworthy that when an update was made to an english file, the plugin would notify you that the other translations need to be updated for that specific section or they would be out of date. The page linked above actually lists how complete each language translation is and maintains that information.

It is possible a tool like crowdin could just be added to the build process as a plugin which has been notified of which files require translating. Then it will maintain the database itself somewhere and you could tell mdbook where the translated files are located.

A solution like this seems worth the time exploring before spending effort creating a new ground up approach to solve the same problem.


EDIT: Also note they offer free support to open source projects

tyoc213 commented 7 years ago

For you information, what about single file for the source???

like

[es]
Esto es un ejemplo
[en]
This is an example
[fr]
Ceci une example

[es]
Esto no
[fr]
Ce n'est pas

Well, just saying :) (I mean for example for making a book/tutorial with code examples it will be better to only have one source code but the explanation in different languages.

And sure, switching between languages could be possible, and if there is no paragraph, show the default language of the document.

sebras commented 7 years ago

How about a src/SUMMARY.md specifying the default chapter structure expected for all languages that are up to date and forcing specialized src/*/SUMMARY.md for the languages that have not yet made similar changes? This puts the penalty on the translations who have to keep a separate SUMMARY.md around for some time and do work to be up to date. The con is that the person updating the English translation does a minor amount of work when, in essence, causing the translation to fork.

*So the rule would be: `src//SUMMARY.mdhas higher precedence thansrc/SUMMARY.md`**

├── book
└── src
    ├── SUMMARY.md
    ├── en
    │   ├── hello-world.md
    │   └── second-and-third-chapter-combined.md
    ├── fr
    │   ├── SUMMARY.md
    │   ├── hello-world.md
    │   └── second-chapter.md
    │   └── third-chapter.md
    └── ru
         ├── hello-world.md
         └── second-and-third-chapter-combined.md

Consider e.g. the case you mentioned above where the original English book combined several chapters into one (or conversely split one into many). In this case the English translation would need to update src/SUMMARY.md, at this point the English author copies src/SUMMARY.md into each translation not yet updated. Hopefully these src/*/SUMMARY.md only stay around for a short period of time until the translations are updated accordingly.

In the example above before the English original text combined its chapters, src/SUMMARY.md is copied into src/fr/SUMMARY.md and src/ru/SUMMARY.md, next the English original text combines src/en/second-chapter.md and src/en/third-chapter.md into src/en/second-and-third-chapter-combined.md and updates src/SUMMARY.md to refer to the new second-and-third-chapter-combined.md (which at this point only exists in en). Some time later perhaps src/ru/second-and-third-chapter-combined.md is created at which point src/ru/SUMMARY.md may be deleted. src/fr might not yet have been updated so its src/fr/SUMMARY.md stays around a bit longer. Once all languages are updated their specialized src/*/SUMMARY.md can all be deleted and all languages can again rely on the default src/SUMMARY.md.

Do you think an approach like this is feasible and desirable?

I'm eager to do a translation of the Rust book, so I'd like for mdbook to resolve this bug and support translations, hence I'm trying to help you make progress. :)

azerupi commented 7 years ago

Thank you for your input!

Do you think an approach like this is feasible and desirable?

Unfortunately, I don't think this will work well in practice because there is a lot of overhead for the author of the original text. Every time the original texts diverge, the burden is on the the author to copy over the old summary to the translations before making a change. If he forgets, things will break, this seems very error prone.

I am more in favour of having one summary per language, cross-link files with the same name. This approach is, in my opinion, simpler to understand and doesn't require any extra work when the original text and the translations diverge.

I hope to make progress on this issue in the "near" future, we are slowly reworking parts of the internals to make it possible.

sebras commented 7 years ago

I am more in favour of having one summary per language, cross-link files with the same name.

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language? I do agree about this design being less work for the original author of course. :)

azerupi commented 7 years ago

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language?

Nothing, it would be a convention. A translation would keep the same file structure and just modify the content of the files. If the translations diverge, you loose cross-linking but everything still works.

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

sebras commented 7 years ago

If the translations diverge, you loose cross-linking but everything still works.

That's a good point. Maybe mdBook can warn if this is the case?

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

Yes, I absolutly. I was worried was no progress because of lack of design discussion, hence my suggestion to try to help you decide. I don't know the mdBook code base (or rust) yet. :)

sebasmagri commented 7 years ago

HI!

I'm probably going to reiterate on some already discussed topics but I'd still like to describe this case hoping it's useful to define the best mechanism for book translations in mdbook.

So I've been trying to define a process we could recommend for a localisation team to tackle tasks such as The Rust Programming Language book translation.

One of the things is how to integrate translated contents with the build output. For this specific case, and after having asked the docs team for feedback, it should be easier to handle all of the book contents independently in its own directory, including SUMMARY.md. This would allow the book translators to work in a completely independent way by forking the book repository and probably integrating it back as git submodules in the original repo. There would not be any kind of enforcement on the document internal structure neither on the phrase level content of translations.

Another thing is how to link translated content in the output. It could be linked on a per document fashion by mapping translations using the exact file name, in which case we'd have folder structure enforcement, or it could be linked only on the front page, in which case translation would have complete freedom on the folder structure, and even the Tree/Table of Contents. In the latter case, the contents tree guidelines could be defined by maintainers but not enforced at all by the tooling.

This two features or mechanisms, however, might not work for people wanting to use tools such as crowdin, transifex or weblate to manage their translations, which is probably more adequate for Software translation than for book translations. To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team. So of course I'm completely open to any kind of feedback and collaboration so we can lay out a plan to implement this.

Regards,

azerupi commented 7 years ago

Hi @sebasmagri

Thank you for the input! I would love to work together with the concerned parties to end up with a strong design that is both useful for simple and more complex requirements.

Currently, the design we are considering is the following:

To make a book multi-lingual, you would have to add some information to the configuration file:

[languages]
en = { name = "English", default = true }
fr = { name = "Français" }
# OR alternatively
# [languages.en]
# name = "English"
# default = true
#
# [languages.fr]
# name = "Français"

For the example above, we would expect to have sub-folders in the src directory, matching the keycodes en and fr used in the config, containing the source files for each language.

We could imagine having an optional source = "path" key in the language tables for more flexibility. This would then allow the submodule scenario you described.

We also think it is better to have a SUMMARY.md file for each translation. This allows translations to diverge without breaking the build.

For the HTML output, we consider cross-linking chapters from different languages based on the file structure. An English chapter called src/en/chapter_2/lifetimes-in-a-nutshell.md would be mapped to all the same chapters in different languages src/*/chapter_2/lifetimes-in-a-nutshell.md. This has the advantage of being simple and degrading gracefully when translations diverge. So if authors want cross-language linking they would have to keep the same structure, but if they don't or the structure diverges, the books will still build fine with non-matching chapters pointing to the index when changing languages.

To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

This seems very complex? I am not very familiar with this issue but it seems to me that it would either require a lot of manual annotations for correct paragraph mapping or some heuristics. I would think this is (currently) out of scope for mdBook. Lets first focus on having basic but strong multi-lingual facilities and eventually expand from there. :)

Does that correspond to the requirements of the localisation team? If there is anything I missed or there are additional requirements that haven't been considered, please feel free to post 😉

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team.

That would be wonderful, I am particularly interested in the perspective of the Rust project on this issue because I think they will be the ones using this feature the most.

cauebs commented 6 years ago

Just to resurface what @mattico said at #687

This should be fairly straightforward:

  1. Add a config option to set the default language.
  2. Determine & document the folder structure used for the translations.
  3. Change index generation to ignore translations.
  4. Set the lang template parameter in hbs_renderer based on the page path + default config.
  5. Add a menu to the html template + stylus.
  6. In hbs_renderer, look for different versions of the current page and add them to the template parameter.
  7. Set the language used to generate the search index.
  8. Add a cargo feature to disable this functionality, since rustc can't have the search language support due to licensing issues. Edit: Links might also need to be adjusted so they point at the page for the current language. This might not be necessary if the correct relative links are used, I'd have to check.
MrFaul commented 5 years ago

Wouldn't it be much easier to simply abuse the branches for translations and a small bot to throw a "needs review" for each language if the English version got updated? This isn't meant to be a permanent solution, more as a stopgap for now since this issue blocks all enthusiasm for any pending translations.

IMO now incoming:

huangjj27 commented 5 years ago

How is the progress for this feature? is that usable now?

himself65 commented 5 years ago

+1

hugmouse commented 4 years ago

Bump.

bells17 commented 4 years ago

+1

trosel commented 4 years ago

Is anyone working on this? I know there was a PR a long time ago that looked great, but it was closed.

laubblaeser commented 4 years ago

More information on the state of this feature would be welcome. :)

netlander commented 4 years ago

Any traction on this feature?

XAMPPRocky commented 4 years ago

If anyone would like try a version of mdbook with localisation support you can try my fork in #1201

Ruin0x11 commented 4 years ago

I tried addressing this issue in https://github.com/rust-lang/mdBook/pull/1306. I would appreciate feedback.

I'm still wondering if you should be able to build the book with all languages bundled in at once, and have a drop-down for switching the language of the current page.

mgeisler commented 2 years ago

FWIW, there are tools to handle translations which I didn't see mentioned here yet. For example, crowdin is used (or was when I was involved) over at freecad for document translation of their wiki. It was noteworthy that when an update was made to an english file, the plugin would notify you that the other translations need to be updated for that specific section or they would be out of date. The page linked above actually lists how complete each language translation is and maintains that information.

Thanks for bringing this up, @mdinger. This is a hugely important point and something @sebasmagri also touched upon:

To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

[...] I would think this is (currently) out of scope for mdBook. Lets first focus on having basic but strong multi-lingual facilities and eventually expand from there. :)

Hey @azerupi, I've worked with translations in the past as part of the Mercurial project. We used the standard Gettext format for this: we had the command line tool itself translated, including the help texts. I used very similar infrastructure to translate a Mercurial guide I wrote.

My experience is that it's the tooling for the translators that is important. That is, it's not helpful to simply create multiple independent Markdown files in separate directories. The result of that is that the translations drift apart: an update to the original text may or may not make it into the translations and there is no system in place to track this.

Translating the software is easy enough: you integrate with Gettext and use its tools to extract the strings to .po files. You let translators translate these — there are tooling for this, such as https://poedit.net/ and various online tools. These PO editors can show the translators what changed since they touched the files last, which is invaluable when updating translations.

What we did to translate bigger pieces of text was to simply split it into paragraphs. So the fr.po file with the French translation found here looks like this:

#: src/index.txt:12
msgid ""
"====================\n"
"Mercurial Kick Start\n"
"===================="
msgstr ""
"====================\n"
"Tutoriels Mercurial\n"
"===================="

#: src/index.txt:19
msgid ""
"Welcome to the `aragost Trifork`_ Mercurial Kick Start. We have\n"
"prepared several different sets of exercises for you:"
msgstr ""
"Bienvenue dans les tutoriels Mercurial de `aragost Trifork`_. Nous avons "
"préparé des exercices pour vous :"

#: src/index.txt:22
msgid ""
"`Basic Mercurial`__:\n"
"  Install Mercurial and get started right away. We will show you the\n"
"  basic commands and show you how to work with others as a team."
msgstr ""
"`Premiers pas avec Mercurial`__:\n"
" Installer et faire ses premiers pas avec Mercurial. Nous vous montrerons "
"les premières commandes et comment travailler avec les autres membres d'une "
"équipe."

The source text for this was formatted in reStructuredText (a Markdown-like format popular with Python projects back then):

====================
Mercurial Kick Start
====================

.. image:: mercurial.png
   :align: right

Welcome to the `aragost Trifork`_ Mercurial Kick Start. We have
prepared several different sets of exercises for you:

`Basic Mercurial`__:
  Install Mercurial and get started right away. We will show you the
  basic commands and show you how to work with others as a team.

This would be the equivalent of the source Markdown file in mdBook.

This seems very complex? I am not very familiar with this issue but it seems to me that it would either require a lot of manual annotations for correct paragraph mapping or some heuristics.

As you can see, there are no markers in the source text: they stay as they are. The Gettext tooling is used to extract the text, and I wrote some extra Python scripts to split the files into individual paragraphs.

The mapping is done based on the content: if the source text is updated, the translation gradually becomes out of date. This is why I split the text into paragraphs: when I rephrased something in my guide, that individual paragraph would turn into English again until someone contributes an updated French translation. The same approach is typically used for software: if a menu item hasn't been translated, you show the original text instead.

In short: yes, this kind of infrastructure comes with a startup cost — and it's a ton of work for the translators to keep a translation up to date. However, with this approach they actually have a fighting chance: the tooling will flag outdated paragraphs and the editors can do fuzzy matching to find the previous translation. Also, note how this infrastructure is trivial for the authors of the source text: they just write it like normal.

I would recommend implementing such a system for mdBook instead of letting people writing what is essentially multiple books split across different directories.

mgeisler commented 2 years ago

I would recommend implementing such a system for mdBook instead of letting people writing what is essentially multiple books split across different directories.

Just a small update: I've started working on this and have written two small tools:

With those two tools, you can translate the Markdown files which make up your book.

While writing this, I realized that the translations can be done "outside" of mdBook: the input is a source Markdown file and the output is a translated Markdown file. All that's left to do is to glue the translation together with something like a language-picker on each page. @Ruin0x11, this means that translating with Gettext and these scripts is fully compatible with the approach in #1306!

I'll put up the scripts when I've tested them a bit more.

sebras commented 2 years ago

@mgeisler This sounds very interesting. Mainly because it converts the process of maintaining a translation into the traditional approach. :) So then what would be checked into the git repo? The original English language Markdown files and the translated .po files? I'm guessing that whenever the original book is upated the corresponding .po files either have existing translation strings marked fuzzy or have entirely new strings added and then translators would find those half finished .po-files and send pull requests to the golden git repo. Whom do you envision to be running extract and reconstruct? In other projects where I am a participating translator tools like that are normally run by the maintainer so that's what I'd assume.

mgeisler commented 2 years ago

Mainly because it converts the process of maintaining a translation into the traditional approach. :)

Yes, precisely! Glad you like it.

So then what would be checked into the git repo? The original English language Markdown files and the translated .po files?

Exactly, those would be the source files, the rest are derived and can be left out. The flow would be something like

$ extract src/*.md --output-file messages.pot      # extract all strings from the source Markdown files
$ msgmerge --update po/xx.po messages.pot          # merge the XX translation with new files
$ msgfmt po/xx.po --output-file xx.mo              # convert xx.po into a xx.mo file
$ reconstruct src/*.md xx.mo --output-dir src/xx/  # write translated Markdown files to src/xx/

Honestly, we probably don't need to compile the .po files into a .mo file — the whole workflow is offline so there's no real benefit in a precompiled catalog.

The translated Markdown files are derived from the .po and the source Markdown files. As such, they don't have to be checked in. However, as the source Markdown files change, the translations immediately become outdated. Checking in the translated Markdown files every time the translation is complete would allow you to still deploy all languages for a book and know that each language is the last complete version.

I'm guessing that whenever the original book is upated the corresponding .po files either have existing translation strings marked fuzzy or have entirely new strings added and then translators would find those half finished .po-files and send pull requests to the golden git repo. Whom do you envision to be running extract and reconstruct? In other projects where I am a participating translator tools like that are normally run by the maintainer so that's what I'd assume.

I think it could be either the maintainer (if you want to commit a messages.pot file to the repository) or the translators themselves. In the Mercurial project we had a small Makefile which made it easy to do make update-po xx to update the xx.po file with the latest strings. I would guess similar helpers could be built for mdBook (either as a script or as a new mdbook command).

sebras commented 2 years ago

Mainly because it converts the process of maintaining a translation into the traditional approach. :) Yes, precisely! Glad you like it.

Honestly, we probably don't need to compile the .po files into a .mo file — the whole workflow is offline so there's no real benefit in a precompiled catalog.

I agree since even software that I contribute translations for do not appear to deal with .mo files until the software is built and installed on a system.

However, as the source Markdown files change, the translations immediately become outdated. Checking in the translated Markdown files every time the translation is complete would allow you to still deploy all languages for a book and know that each language is the last complete version.

Right, so then you'd have a few paragraphs of the translated langauge, then a section in english and then trailing sections in the translated language? Until the translators do their magic that is. Sounds like a good approach, and yeah if the intent is to parse the markdown files directly from the git repo onto a webserver perhaps it would be beneficial to actually check in the translated markdown.

I think it could be either the maintainer (if you want to commit a messages.pot file to the repository) or the translators themselves. In the Mercurial project we had a small Makefile which made it easy to do make update-po xx to update the xx.po file with the latest strings. I would guess similar helpers could be built for mdBook (either as a script or as a new mdbook command).

If the maintainer updates the .po files whenever strings are changed/added then they'd know that all translations are up-to-date, albeit maybe not perfectly translated. And then the maintainers just poke the translators and wait.

This all makes sense to me, but I'm not affiliated with the project, have your heard anything from them? Are they as eager to get extract and recontruct as I am? If you direct me to your scripts I can take them for a spin and see if they work for me (too). :)

mgeisler commented 2 years ago

This all makes sense to me, but I'm not affiliated with the project, have your heard anything from them? Are they as eager to get extract and recontruct as I am?

Not sure :smile: I simply implemented the infrastructure which I expect to need myself in a few months. I hope it'll be useful for the project maintainers too since it can help solve a very long-standing issue for the community.

If you direct me to your scripts I can take them for a spin and see if they work for me (too). :)

I've put up a PR which adds new commands to mdbook to drive the translation process: #1864. Please give it a spin and send me feedback there! I'll might be slow to respond since I'm traveling to the US next week for RustConf, but I'll get to it eventually.

lukehinds commented 2 years ago

How close is this to shipping? I am currently assessing what docs generator I can use for an OSS project and really like mdbook, but will need translations available. I know this is volunteered effort (and thank you) but this issue has been open for seven years now, so hoping it should be close right ? :)

aellwein commented 2 years ago

Please, please make it happen 👏🏻

mgeisler commented 2 years ago

@lukehinds and @aellwein, could you please try the code I put up in https://github.com/rust-lang/mdBook/pull/1864 ? Please give me feedback on the PR, that way I can learn if that approach makes sense to more people than just me :-)

fzyzcjy commented 2 years ago

Hi, is there any updates? Thanks

mgeisler commented 2 years ago

Hi, is there any updates? Thanks

Hi @fzyzcjy, you could try out the code I put up in https://github.com/rust-lang/mdBook/pull/1864 and let me know how it works for you.

Ruin0x11 commented 1 year ago

To @ehuss or anyone else who maintains mdBook, I would like to know what I can do to help add support for this feature. There is some code in #1864 that adds support for translator tooling to my original PR in #1306. However, the last time I rebased my original code I didn't seem to get much feedback. I would like to make sure that any work I contribute this time won't be in vain. Is there a roadmap for having some iteration of these features looked at in the near future?