[Pre-RFW] The Translation View: Conceptual Overview

Key Concepts

Project
Segment
Token
Source (language)
Target (language)

Project is the high level unit of work. Project is that which contains everything else. In practice a Project is translating a text from one language to one or more languages other languages.

Segment is a sentence or other similar language structure.

Token is a word or other similar language structure.

Source is the language on which the translation to another language takes basis on.

Target is the language to which the Source is translated into. Single Project can have one or more Targets associated with it.

Translation View is the part of the Lopenling eco-system that lives under https://lopenling.org/translate and covers the key functionalities related with translating and editing texts. This is the focus of this conceptual overview.

External APIs refers to various APIs that live outside of the application, but which are under our control and can be readily accessed to enrich the translation experience.

Practical Example

The Project is translate Bodhisattva's Way from Tibetan into English.

The entire text is broken down into Segments either manually by the translator, or automatically by a solution such as Botok.

Each segment is broken down into Tokens automatically by a solution such as Botok.

The Source language is Tibetan.

The Target language is English.

The Translation View is populated with the Segments, where each Segment is translated into English.

Various External APIs are utilized, for example, to match Tokens with corresponding dictionary matches through the dictionary lookup API.

Payloads

A general point here is that perhaps initially we start by having nothing but the payload and leverage a short-term cache for managing it. Then when something is stored to a long-term storage, with more persistence than just the browser cache, it is the actual payload that gets stored.

In short summary, there is ever just one shape of each payload.

The benefit in this app is that everything is very small and there is small number of it at worst.

In terms of different payloads, they all have consistency in the way they are structured as much as possible.

Project Payload

We will have a single payload shape pervade through the whole system. This is what is stored in psql, what is delivered through the Hasura API, what is used for the frontend, and what is used for all other consumption end-points such as end-user REST API, or business logic functions.

From this arises the requirement to have a dictionary format payload which contains all the data for a Project:

Project-Name (id associated with user-id)
- Segments (id associated with Project)
  - Source (segment of text)
  - Target/s (segment of text)
  - Notes (add notes for the segment)
  - Style
  - Custom (this is a dictionary)

This is a rough sketch of how the Project payload is structured.

The number of Segments under one Project can range from just a few, to t dens of thousands.

A Project payload might end up being 5 to 10 megabytes for the largest projects. One user may end up having up to hundreds of projects, most of which will be very small.

Correspondence Payload

Key part of the translation process is communication with scholars, other translators, and editors. Correspondence can be either specific to a Segment or range of Segments, or general to the Project. It is handled through the same payload.

Correspondence (map with user-id)
- Message (map with correspondence-id)
  - Participant
  - Content
  - Status (open / resolved / closed)
  - Reply-To (map with message-id)
  - About-These-Segments (one or more segment-id or None)

These two payloads, Project and Correspondence, govern the native content of the Translation View.

A Correspondence payload might at the most end up being hundreds of kilobytes. One project may end having thousands of correspondences for large project.

Frontend

The Translation View will consist of the following:

Project selector
Segments pane
Auxiliary pane

Project selector governs which Project payload will be delivered.

Segments pane displays segments side-by-side (Source on the left and Target on the right) one Segment per row. When the area of the Segment is clicked, that Segment becomes active and is highlighted. The Segments pane will have a search bar above it for performing text search on the segments.

Auxillary pane is shown immediately to the right from the Segments pane. Auxiliary pane will display auxiliary information per segment. For example, a machine translation suggestion to be used for populating the Target field. Auxiliary pane has several modes:

Glossary mode
Machine Translation mode
Suggestion mode
Comments mode
Notes mode
Styles mode

Glossary mode takes up the whole pane for a single active Segment, showing all the glossary matches for the Tokens in the Segment.

Machine Translation mode takes up one row per Segment, showing a machine translation suggestion for each Segment in view. It can sometimes take time to load the machine translation suggestions, so the loading experience has to be considered in the design.

Suggestion mode takes up the whole pane for a single active Segment, showing all the translation memory matches for the Source Segment.

Comments mode takes up the whole pane for a single active Segment, showing all the relevant comments.

Notes mode takes up the whole pane for a single active Segment, showing all the relevant notes.

Styles mode takes up one row per Segment, showing the current Styling settings for each Segment.

Then the question is how do we put it all together in a manner that is dead simple to maintain and further develop. So that literally a kid can come and do it as their first thing?

Storage

How to make it "hands off" and "bullet-proof" in terms of storage? Here the sensitivity in terms of data is mostly on not losing it at all, not even one bit, side.

So how to make it so that it does not take even the kid as there is never anything to do. What is the simplest and most bullet proof way to store data today for long time but still accessible? What is a good backup system for that?

Backend

Because all the things in this app are records and the data are very small, there is good fit for such system as just having psql and Hasura to handle everything and try to avoid the idea of backend otherwise as much as possible.

Went over it in long discussion. Got a lot of questions, but will post them later in their own relevant RFWs

For easier discussion @sidrun made sketch. It isn't anything of an actual design, it exists just to make discussing different parts easier. I painted it over in Paint :)

RFW parts could be as follows:

Main layout + Project Selector That shouldn't include anything else. No segments, no auxillary, just main layout of page, like headers, footers, navigation menu and then project selector. Thought that main layout itself would be too small of an RFW and would include something into it. Project selector sounded like best thing to include there. If one of them grows too big, then could also split it out and have 2 separate RFWs about them
Segments pane as a whole
- Would there be need for filters and sorting in segments pane, like "Show only untranslated", "Show only translated" and sorting like "Order by newest", "Order by last translated" etc? If there's need for filtering and sorting, then they could be part of segments pane RFW. It seems to be quite straightforward, but when writing RFW and it becomes to grow too big because of sorting and filtering, then could split them out to separate one
Auxillary pane has lots of "subfeatures" they all should be separate RFWs:
- Pane itself, with some basic switching between "full" and "row" mode
- Glossary mode
- Machine Translation mode
- Suggestion mode
- Comments mode
- Notes mode
- Styles mode

Some questions that arose were:

How are projects managed? Maybe we should do project management RFW first?
Will there be some kind of review system?
What's the difference between comments and notes. Maybe "Notes" can be merged to "Comments"? 🤔

Storage and backend will go more into details of each separate RFW. Most likely will end up using browser localStorage for in-browser storage and sync it to Hasura. Just to make sure that accidentally closing browser tab, loss of power or loss of network won't delete any data. It's going to stay in local browser until browser is re-opened and data is sent to Hasura It might change lil bit, like instead of localStorage might need to use indexDB or sessionStorage or mix of those. In perfect world would only use localStorage, but might run into some limitations (localStorage allows saving only up to 10mb of data, while indexedDB can use up to 80% of computer drive). Will start with localStorage and re-evaluate, when starting to work and test with real data

In backend will try to use Hasura as much as possible. If for whatever reason can't use Hasura, then need to investigate those cases case-by-case. Currently it's hard to predict any

lopenling / Requests