mdn / yari

The platform code behind MDN Web Docs
Mozilla Public License 2.0
1.19k stars 508 forks source link

Archive localized content #1076

Closed peterbe closed 3 years ago

peterbe commented 4 years ago

(this issue isn't firm as an actual action item)

Here's a plan for archiving all localized content:

  1. Yari's importer (aka. "dumper") can split archived vs. non-archived and it does this based on the slug prefixes. We can extend this such that documents whose locale is != en-US will also be archived (*).
  2. We render out localized content with Yari, using the en-US chrome but put a special banner on archived content so it's clear that what you're reading is not actively maintained. But at least it's not a 404.
  3. (in the future) If we find that we want to actively maintain, say, Chinese, and we have found someone we can trust who can merge content PRs on Chinese content, we can copy it from the archive and re-instate it. But the chrome will remain in English.
  4. (in the future) If there is newfound time/money/passion for localized content being actively maintained, we can make the chrome localizable and use Mozilla Pontoon like we did with Kuma.
  5. We will not link to the translated versions of an en-US document (the other-languages drop-down). But each translated document (archived) will link back to the English version.

(*) Reminder; archived content goes into a different git repository. It's the rendered out HTML only (the document meat, not the page). But the original source Kuma-HTML is saved. Its URLs are not included in search or Sitemaps.

peterbe commented 4 years ago

Important notes and stuff:

peterbe commented 4 years ago

Advantages of this plan:

peterbe commented 4 years ago

Disadvantages of this plan:

peterbe commented 4 years ago

Hybrid solution:

Since what we intend to do with Yari is that traffic will come in to CloudFront and Lamda@Edge will conditionally look up if it's statically built in S3 or otherwise fall back to Django, we could do the "jamstack thing" for all English documents and keep doing the non-English documents with the existing Kuma. Ie. put English Yari as a mask/layer on top of good old Kuma (plus Wiki) so that English is served from Yari and Japaneese is served from Django.

This approach is the least disruptive but it's not without complications and headache. For example, the localized content thrives on the English document as the "parent" document and Doc_status tooling will eventually break. We also won't be able to connect translated content to the English one and vice versa since they will live in different universes.

peterbe commented 4 years ago

Think about the alternative (to not archiving the lot):

The original plan of Yari was that we'd do all languages just like the Wiki. Instead of 11k documents, we'd have to have about 30-40k documents in git. And we were going to make the chrome localized (@fluent/react + Mozilla Pontoon). And instead of the Doc_status dashboards in the Wiki we were going to write some brand new tools to help translators direct their attention to documents that needs to translations or updates. We also had plans for automation that can quantitatively figure out if a translation is "bad" (ie. compare code examples, non-prose keywords, heading counts, etc). Perhaps we could also write a toolbar or web UI tool so you can pair two documents literally side by side next to each other (one pane in English the other pane a giant <textarea>).

But one thing we never considered is; who's going to merge and sign off on a git PR when they don't understand what is says? Perhaps the simple workaround would be to just merge it as long as it doesn't appear to introduce external links to sites that look shady. Another "workaround" is to leverage the Mozilla Reps program or Mozilla's own L10n team to drum and develop a chain of trust so that we non-Mozilla-staff contributors we trust to at least review PRs but leave the merging to someone who is staff.

The other harsh truth is that for the past 6+ months we've been developing the Yari prototype by focusing on English. That means that a lot of language-related features haven't gotten its fair share of testing and experimentation and problem-solving. Who knows, will you get out-of-memory errors in CI if you try to build all locales?

There's also the potential of using machine learning (aka. external APIs) to somehow have an external service translate English HTML documents to various languages and then we serve them. This is not without some immediate drawbacks:

  1. They cost thousands of $$$. (it's probably affordable for Mozilla but I dare not even think about how long it would take to clear the paperwork about who/how to pay for it)
  2. We haven't even started writing the tooling for any of it.
  3. How do you decide when and what to have it translated?
  4. As part of the tooling, we'd also need to take on the security and reliability risk of any such tooling.

All of the above is totally feasible and would even be quite fun to work on. External APIs or UI tools, it's just engineering. Translators who enjoy contributing would be happy. We would maximize our SEO footprint by having all the translations. It would make us not look so "North America global" if we support "every" language. Glory all around!

But, building all of this will take time. Many months. With a reduced development team, assume you have to double the time until we get to a working solution. In the meantime, the Wiki is open. With the doors wide open and nobody employed to police the un-reviewed production edits, the quality of MDN will deteriorate more and more and ruin all the hard work MDN has worked to establish itself as the best source of truth for learning and looking up about web development.

peterbe commented 4 years ago

Note-to-self; We need a special banner for non-English content that indicates quite loudly that it's not actively maintained. For example, the "Edit in GitHub" link needs to NOT appear on these pages.

peterbe commented 4 years ago

Here's the issue about naming things: https://github.com/mdn/yari/issues/1242

peterbe commented 4 years ago

The code that dumps translated content to a separate repo than the archived or the active-English is already in place.

The next action is:

  1. What should it say in the banner on now frozen translated content?
  2. What should the banner be located and look like?
peterbe commented 3 years ago

This is too old now. We have built most of the things we need up until now. We already have the stuff in place and working for freezing translated content.

https://github.com/mdn/yari/pull/1673 is an interesting idea of uplifting translated content.

At this point, this issue isn't really helping us make progress so I'm going to close it.