Make MarkdownFields translatable

jeriox commented 2 years ago

Currently, when using wagtail-localize, a MarkdownField cannot be translated in an easy way, as the whole content of the field is put into one translation segment. For a long page with a markdown body, this is not feasible. I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately. I wrote a hacky solution for that some time ago, but it breaks with the current version. I'd be happy if we could find a way to support that properly.

My old code for reference:

import html2text
from django.db.models import TextField
from wagtail_localize.segments import (
    OverridableSegmentValue,
    StringSegmentValue,
    TemplateSegmentValue,
)
from wagtail_localize.segments.extract import quote_path_component
from wagtail_localize.segments.ingest import organise_template_segments
from wagtail_localize.strings import extract_strings, restore_strings

from wagtailmarkdown.utils import render_markdown
from wagtailmarkdown.widgets import MarkdownTextarea

class MarkdownField(TextField):
    def formfield(self, **kwargs):
        defaults = {"widget": MarkdownTextarea}
        defaults.update(kwargs)
        return super(MarkdownField, self).formfield(**defaults)

    def get_translatable_segments(self, value):
        template, strings = extract_strings(render_markdown(value))

        # Find all unique href values
        hrefs = set()
        for string, attrs in strings:
            for tag_attrs in attrs.values():
                if "href" in tag_attrs:
                    hrefs.add(tag_attrs["href"])

        return (
            [TemplateSegmentValue("", "html", template, len(strings))]
            + [StringSegmentValue("", string, attrs=attrs) for string, attrs in strings]
            + [OverridableSegmentValue(quote_path_component(href), href) for href in sorted(hrefs)]
        )

    def restore_translated_segments(self, value, field_segments):
        format, template, strings = organise_template_segments(field_segments)
        return html2text.html2text(restore_strings(template, strings))

zerolab commented 2 years ago

Hey @jeriox,

thank you for sharing this. Had a few requests for making this localize-compatible, so the code snippet is very handy!

jeriox commented 2 years ago

I got it working again with the code above, we will use that for now. Still feels a bit hacky to me, so we'd be happy if there was a better alternative built in :)

zerolab commented 2 years ago

This would need a bit of thinking. e.g.

I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately.

Where do you draw the line and split things? is it at every link? every paragraph? every heading? given we can allow raw html in there too, how should we handle that?

jeriox commented 2 years ago

This would need a bit of thinking. e.g.

I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately.

Where do you draw the line and split things? is it at every link? every paragraph? every heading? given we can allow raw html in there too, how should we handle that?

Currently, my approach works as follows: as there is already a lot of thought going into how to split up StreamFields, I tried to reuse that as much as possible. Therefor, I render the markdown to HTML and use the existings extract_strings() method. This also ensures that links are treated appropriatly. For the other direction, using html2text works quite well. I didn't test with raw HTML though. I think that every paragraph and every heading is a good split, as it ensures that one doesn't need to re-translate it if the page didn't change.

jeriox commented 1 week ago

Hey @zerolab, did you have a chance to look at this any further?

zerolab commented 1 week ago

@jeriox to be honest this completely flew under my radar 🙈

I think your version is better than what we currently have (i.e. nothing). Do you have the capacity to submit a PR? We'd want the logic in get_translatable_segments and restore_translated_segments to live in its own module (say wagtail_localize.py and be conditionally loaded if localize is installed

jeriox commented 1 week ago

I think your version is better than what we currently have (i.e. nothing).

While this is true, I'm not sure if it is good enough to include it in the library. We have been using this solution in our project for two years now, and there are several problems:

references to other headings on the same page (e.g. #about) get lost during translation
as the content gets broken down into very small parts (e.g. single entries in a list), we struggle a lot with https://github.com/wagtail/wagtail-localize/issues/624
images suffer from https://github.com/wagtail/wagtail-localize/discussions/378
inline formatting sometimes produces additional spaces during translation

If those are okay for you, I can open a PR. We'd like to do the splitting on our own instead of relying on converting to HTML, but we didn't have the capacity to do so yet

zerolab commented 1 week ago

Thank you for the additional context on real-life usage. Absolutely fantastic to know.

What if we make the get_translateable_segments bit pluggabable (i.e. you can change it to your own project's method that does what you want it to do?

The images question is outside of wagtail-markdown's purview, I'm afraid. We definitely need to solve this more centrally.

jeriox commented 1 week ago

I'm not sure that it needs to be specifically plugabble, as you could just subclass the provided MarkdownField and change the get_translateable_segments if you are not happy with it, this is the same approach that we are currently using to implement it in the first place.

So I guess we could just include my current solution as the default, especially if images are out of scope anyways and maybe the issue with duplicate segments gets fixed centrally as well. The other things are just small issues IMO and could just be mentioned in the docs

torchbox / wagtail-markdown

Make MarkdownFields translatable #102