suttacentral / bilara

Our Computer Aided Translation software
10 stars 8 forks source link

Add "suggest" mode #14

Open sujato opened 5 years ago

sujato commented 5 years ago

When proofreaders etc. are using a text, or sometimes even just for oneself, it is necessary to have a "suggest" mode. (Not to be confused with TM suggestions AKA matches!)

This is a normal part of the development workflow on Git, and is handled via the "pull request". We should map our usage onto Github's as closely as possible.

Suggestions may be made by separate proofreaders, or by a translator on their own work (for example, they may want to remind themselvs to do more research.).

See the mockup for this design on suggest_dummy branch.

Flow for the proofreader

A user may add a correction, a message, or a blank.

For the user, the process is as follows.

  1. Enter suggest mode. Here, all entries will be pull requests on new branches.
    • This applies equally regardless of whether you are on unpublished or published.
  2. When wanting to make a suggestion or correction, the user clicks on a translated text field.
  3. A proofreader may not edit the field, but clicking on it activates a form field.
  4. The form field has a main text input for making the suggestion. At its initial state, this contains the current translation.
  5. The suggester makes whatever edits they want to this.
  6. There is also a second text field that is for messages. A suggester may write a message explaining their change. Or they may leave a message without making a suggestion.
  7. Leave both text fields blank to say "needs work".
  8. Messages are added to the commit message.
    • Bear in mind, normally no message is needed, they just fix a typo or whatever. So it's fine to use the default commit message. But on the Bilara UI, only actual messages should be shown.
  9. When user is ready, they click "submit".
  10. This is equivalent to "Create a new branch for this commit and start a pull request" on Github.
    • I think it should automatically trigger the next stage, "create a pull request". These are two separate stages on Github, but I think we should combine them.

Flow for the main translator

  1. When they log in, if suggestions have been made, there is a notification.
  2. Click on that and see a list of suggested changes, with messages.
  3. Suggestions are defined in a <form> field below the translation segment.
  4. User name of suggester is defined in fieldset legend.
  5. Each suggestion has three options:
    • Accept: proposed change is set for commitment. This is checked by default (because usually proofreaders do a good job).
    • Reject: proposed change is set for deletion
    • Leave: proposed change is left as-is for further consideration
  6. If the suggester has left a message, this is visible.
  7. There is a text input field for responding to the suggester or just leaving a message. This becomes a comment on the PR.
  8. Once the action is selected and note made, click submit. This applies to the whole field.
    • The accepted text is added to the translation field
    • Rejected text is deleted, i.e. the pull request is closed.
    • Message is submitted as comment to PR.
    • Important!: Submit places focus on translation input field. This is so the translator can keep a smooth flow going. They don't have to move the cursor or the mouse. Click submit, double-check text is correct, press enter.
  9. After suggesting or rejecting, the translator commits the text as usual by entering while focussed on the text field. If they like, they can edit the accepted string before committing.
  10. Each suggestion remains visible on the screen as the translator works through them. This is because sometimes you change your mind!
    • Perhaps we could make a more sophisticated UI for this, but for now, just leave them.
  11. If there are two or more suggestions, you can accept one of them or just edit the field. Don't try to merge them. In most cases, dual suggestions would be incompatible anyway.
  12. If there is a merge conflict—for example, if the segment has been edited after the suggestion was made—then do not try to resolve it via Git. Just tell the user that there is a conflict, let them resolve it by just writing in the text field and deleting the suggestion.
  13. When you reach the end, a "no more suggestions" message.

UI

The mode should be togglable from the toolbar, and it should create a obvious UI effect so that users are clear when they are suggesting and when committing.

UI for Suggester

<form class="suggest">
    <fieldset>
      <legend>Suggestion by NitPicker</legend> 
      <input type="text" id="note" name="note" title="Write suggestion here" value="The content of the original translation string." />  
      <input type="text" id="message" name="message" title="Leave a message if you wish" /> 
      <button type="submit" title="Submit suggestion and/or message">Submit</button>
   </fieldset>
</form>

UI for translator

Screenshot from 2020-03-30 19-50-00

<form class="suggest">
    <fieldset>
      <legend>Suggestion by NitPicker</legend> 
      <span class="suggestion" title="Suggested revision for translation">Thus have I heard.</span> 
      <span class="radio-buttons">
           <input type="radio" id="accept" name="edit" value="accept" checked="checked" /> 
           <label for="commit" title="Accept suggested change to translation">Accept</label> 
           <input type="radio" id="reject" name="edit" value="reject" /> 
            <label for="suggest" title="Reject suggested change to translation">Reject</label> 
            <input type="radio" id="leave" name="edit" value="leave" /> 
           <label for="suggest" title="Leave suggested change to translation for further consideration">Leave</label>
      </span> 
      <span class="message" title="Message on translation or clarification of suggestion">Perhaps this could be improved by making it good.</span> 
      <input type="text" id="message" name="message" title="Leave a message if you wish" /> 
      <button type="submit" title="Submit form content and shift focus to translation text field">Submit</button>
   </fieldset>
</form>
brahmali commented 4 years ago

Cool! Do you personally know this fellow Nitpicker? I could use them! Otherwise I say let's go for it.

blake-sc commented 4 years ago

I'm not sure that a Github PR would be the most appropriate format, due to general unwieldiness when interacting with PRs (basically PRs live on Github, so the server would have to be talking back and forth with Github whenever it has to deal with suggestions, which would seem to me to be unwieldy, slow and error-prone (relevant xkcd): the very best way to add bugs to software is to do more network requests. At the moment Bilara is fully functional without Github, it just synchronizes its git repository with the one on Github - if Github goes down for a day it wouldn't stop Bilara working besides new logins, and I'd like to keep it that way).

I would propose having suggestions stored in JSON in the repository in a special suggestions file.

The file might look something like this:

{
  "translation-en-sujato":{
    "dn1:1.1.1": [
      {
        "root_string": "Evaṃ me sutaṃ—",
        "original_translation": "So I have heard. ",
        "suggestion": "Thus have I heard. ",
        "suggester": "NitPicker",
        "last_modified": "...",
        "status": "pending|rejected|accepted_with_modification",
        "proofreader_message": "I like this better",
        "translator_message": "I think it's dumb"
      },
      { ... }
    ]
  }
}

Essentially each time a suggestion is made an entry is added to suggestions.json, it is grouped by the meta UIDs (or perhaps there might be a file for each set of meta UIDs), and at the next level are the segment IDs, with each ID having a list of suggestions, which will normally only contain one suggestion but can contain more.

This suggestions file could be parsed and displayed on its own page, as a list of suggestions. And perhaps any suggestions could also be loaded into the general translation view for a translation.

Accepting

When it comes to accepting, I propose 3 basic possibilities:

  1. The translator accepts the suggestion, if the suggestion is simply accepted then the translation is updated and the suggestion removed from suggestions.json (we could also consider letting the proofreader see their suggestion was accepted).
  2. The translator keeps the current translation. This changes the suggestion status to "rejected". The suggestion remains in suggestions.json. The proofreader can see that their suggestion was rejected.
  3. The translator edits the string, and submits a string which is neither the same as the original translator nor the suggested string (like they take inspiration from the suggestion, but think it can be made even better). This updates the translation, and sets the suggestion status to "accepted_with_modification", the proofreader can see their suggestion was accepted with modification.

Conflicts

A suggestion can be stale, that means that the current translation is not the same as the original translation the suggestion was made against. For example it might have taken a few days for the translator to get around to checking suggestions, during which time he'd made some search-replaces which modified the original translation.

There are two possible points in time when a stale suggestion can be detected, when the suggestions are displayed on the client, or when the translator tries to accept/modify a suggestion. It would probably be best to try and detect stale translations when they are displayed to the translator, and it's unlikely the translator will ninja-change his translations while reviewing suggestions.

When there is a stale suggestion, the UI should display something like:

Stale Suggestion! Original: So I have heard. 
Current: Heard thus I did. 
Suggested: Thus I have heard. 

The suggestion state would be set to "accepted" if the suggestion is accepted, "rejected" if the current translation is accepted or "accepted_with_modification" if the translator composes a translation different to both the current and suggested translation.

We could also consider simply not reviewing stale suggestions, that is stale suggestions get bounced back to the proofreader.

Proofreader Actions

Just as the translator can review the pending suggestions, the proofreader should be able to review all of their own suggestions, including seeing stale suggestions. They can modify a suggestion regardless of its state, this changes it to pending state (and updates the original_translation if it was stale). Or they can dismiss a suggestion which simply deletes it, again this option is available regardless of its state.

Perhaps the server would automatically remove suggestions which are in an "FYI" state like rejected/accepted_with_modification after a period of time.

sujato commented 4 years ago

I see the issue with the network connection, but I'm not sure that I am fully convinced. Basically I am concerned that we are throwing away all the rich support of Github's handling of errors and merging.

Philosophically, I want to avoid treating Bilara as a full-featured translation app. It shouldn't become Pootle.

Rather, think of Github as the basic app. Bilara is a bit of GUI grease to make editing translations on Github easier. It should only be used for things that Github doesn't handle well. It's not easy to edit JSON files by hand on Github, so Bilara helps with that. Search is clumsy and used all the time by translators, so we need that.

But handling changes to files, recording the history of them, remembering who did what when, sorting and viewing them in extremely flexible ways: Github is awesome for that.

Okay, so I have three specific problems with your proposal, please correct me if I'm wrong here. Obviously you're the one building the thing, so you decide what mechanism we use. But I want to make sure we meet these requirements.

Visibility of history and other details

I'm not sure exactly what you mean by the "repo" where suggestions.json would live. Where exactly does this JSON file live, and how can I see it? How are changes made to it other than through bilara?

It's essential that we see the history of proposals and changes. First, because it is good to record people's contributions. Second, because sometimes you want to review things. Maybe you think you've made a change, and it doesn't appear, so you want to check whether it was made. Or you accept a proposal, but months later, it seems wrong to you and you can't remember why you accepted it. All this is stuff you get out of the box on Github using PRs, which, again, were designed for exactly this purpose. I'm really reluctant to just throw all that away.

But, if your proposal keeps this stuff viewable on Github, then we're good.

Complexity of maintaining JSON IDs

Next, I'm nervous about adding yet another json file synced by ID. This is mostly because of what happens if/when we change IDs. We have to assume this will happen sometimes, and it will break things. Currently for ID changes to content we use bilara I/O, but this would then add another complexity to that. Of course, in the PR method, it will break as well, the difference is, there Github already handles merge conflicts and errors, so we don't have to implement all this ourselves.

The IDs only deal with substantive content, and it seems to me that should not depart from this without expecting additional costs.

Async messaging

The other problem is that, to my mind, it doesn't fully distinguish between the jobs of translating and proofing, specifically in regard to time and place.

When you translate, you want to see the changes immediately. You want to be able to go to another sutta and see your previous changes right away.

But when you are doing proofing, there will inevitably be time lag, so this is not an issue in the same way. Moreover, the translator and the proofer are typically in different places, so the basic problem then becomes: how do you handle async messages?

This can't be done locally. The purpose is to communicate with someone else, somewhere else, sometime else. And this obviously requires a network.

Still, it doesn't require a network connection live every second. So how about this: we store data locally in JSON as a person works, and use the service worker to communicate with Github in the background, making the PRs and changes along the way. That way, someone can edit or translate offline, and when they get back online, the changes will be synced up. I'm guessing changes to Github already do something like this?

I do get the point about negotiating handshakes with Github for everything, and the potential for errors that creates. But anyway, I think we should be clear that the problem is not latency as far as the user is concerned.

Accepting

Now, leaving all that aside, to respond to your proposed mechanism for accepting, may I suggest some changes.

First thing is, both the proofer and the translator (and others in the project) must always be able to see the entire history, what is accepted, rejected, and so on. However, it is not necessary to view this on Bilara.

Proofers can review the history of suggestions on Github. If changes are done via PRs, they can look at the history of PRs. If all the suggestions are just changes to a json file, they could look at the history of changes to that file.

Regardless, whatever mechanism we use, the basic requirement is that the history be readily viewable on Github without building special views on Bilara. Translators and proofers are serious users, not flybys, we don't have to hold their hands for everything.

As for states, we should have four states, not three:

pending | accepted | rejected | modified

Only pending suggestions are kept in the "live" data and visible on Bilara. Once it is accepted/rejected/modified, the file is updated to remove the entry (or the PR is closed or whatever mechanism we end up using).

That way we can simply have one suggestions view for a given translation. Keep the views simple! Anyone working on that translation can click "Pending suggestions" and see the same list. Pending suggestions need to live on Bilara, because they are edited. All others don't need to be seen on Bilara at all, just use Github.

Better to use simply modified rather than accepted_with_modification. It is not uncommon that a proofer will identify a genuine problem with a translation, yet their proposed solution is not ideal. In such cases, the translator makes a new translation that is not simply a modification of the suggestion. Anyway, the point is, modified is simpler and more general.

Stale translations

One minor issue, you say, "it's unlikely the translator will ninja-change his translations while reviewing suggestions". Actually, not unlikely at all. It's expected that sometimes a proposed change to a term or phrase will occur in more than one place, and you'll want to make sure that it is fixed everywhere before moving on with other suggestions.

More generally, though, I think this is over-engineering the problem of stale suggestions. Typically a suggestion is just a minor change to a short piece of text, one that the translator is very familiar with anyway. There's very little cognitive load when reviewing text and suggestion: just a glance in most cases. A translator is not really interested in whether the suggestion was based on an earlier version. They just want to know if it will make the translation better. Extra details add cognitive load without tangible benefit. The translator wants to just see, click, move on.

In a small percentage of cases, a translator might want to look at things in more detail. For example, they can't remember why a change was made. Again, in such infrequent cases they could simply review it on Github. We don't have to anticipate every context.

So I'd recommend ignoring the issue of stale translations as far as the GUI is concerned. Just deal with the suggestion as normal. They only become an issue if there is a merge conflict.

If I am wrong and translators demand visibility of stale translations, let it be an enhancement later on. For now, let's get a simple MVP.

blake-sc commented 2 years ago

Okay then, so reviewing prior discussion and requirements

First thing is, both the proofer and the translator (and others in the project) must always be able to see the entire history, what is accepted, rejected, and so on. However, it is not necessary to view this on Bilara.

Proofers can review the history of suggestions on Github. If changes are done via PRs, they can look at the history of PRs. If all the suggestions are just changes to a json file, they could look at the history of changes to that file.

Regardless, whatever mechanism we use, the basic requirement is that the history be readily viewable on Github without building special views on Bilara. Translators and proofers are serious users, not flybys, we don't have to hold their hands for everything.

Okay then.

Using branches and pull requests

Let's suppose the most "obvious" approach is used. A suggestion is implemented by creating a new branch with the applicable file being modified, a commit is created which includes some extra details in the commit message, and then a pull request is created from that new branch to unpublished.

By reviewing the history of that file, it is possible to see accepted suggestions, however rejected suggestions never become a part of unpublished branch, they do exist in the alternative branch and locally if the branch has been pulled a "joint history" can be used to see the rejected suggestions. A git user with the apprentice wizard skill level could certainly do this. On Github? I don't know how to see multiple branch history.

The history of both accepted and rejected suggestions would also be saved as Pull Requests, which in some manner cause the branches to continue to in some sense exist even if they have been deleted from the repository (as a branch can be restored from a PR). However it must be noted that Pull Requests are kind of just a "good faith" history of what has happened on Github, they are not designed to be immutable and Pull Requests can be edited freely, for example the account could get hijacked and all the messages rewritten with spam links, not saying it's a likely scenario but Github wouldn't have a problem with this, while the git repository system would literally survive a nuclear war bombing out the Github servers, if there's a clone of the repo somewhere that clone contains all the data and all the history of the repo and no-one can modify the history of that clone without permission of the user (git history can be edited quite easily, but only merged into a clone through the use of git pull --force ).

Pull requests are not a part of the git repository itself so that data doesn't exist in a git clone, if Microsoft decides to change how Pull Requests work on Github they can totally do that, the behavior is not strongly specified, this is unlike git intrinsic behavior, where if you install a particular version of git, it will continue to work in exactly the same way until the heat death of the universe: since Github is a web app Microsoft can change it whenever they want. Also any behavior which depends on Github specific functionality means that Bilara is locked to Github, whereas behavior which only depends on git means it could be migrated more freely to alternatives like Bitbucket, Gitbucket or GitLab: why might this be desirable? Besides Microsoft deciding they want to be evil after all, perhaps we want to make Bilara more widely available and some organization just happens to use self-hosted GitLab and want to continue doing that.

Generally my approach has been dependency on git is good and dependency on Github is something we suffer, and ideally it should be as easy as possible to substitute Github for another host of git repositories. I am also less adverse to what could be called "deployment", that is stuff that is basically leaving the sphere of the bilara app, like with publishing, Bilara makes the pull request and keeps track of it until it is closed then forgets about it for the rest of time and never wants to know about the contents of that pull request again.

Comment only suggestions

Anyway, lets go back to another form of suggestion: a comment but with no change to the translation string. Now there is no proposed change to any file, oops! A git commit can only be associated with a file by actually changing that file. It is technically possible to create an empty commit which is just a commit message with no changes to files (this is basically so users don't have to make an "almost nothing" change like messing with the whitespace just to make a commit for some ulterior motive). It's also technically possible to create a pull request against an empty commit, such a pull request proposes no changes to files, it is simply a question of whether the empty commit should be integrated into the destination branch. However this is clearly horrible, because these suggestion comments are now dissociated from the pertinent file and won't be discovered when looking at the history for that file, instead if would be necessary to do a string search against commit messages or the pull requests. Perhaps even more abominable the acceptance or rejection of the pull request simply includes or doesn't include the empty commit and says nothing about changes actually made to the file or not. In short, there's a lot of meaningless and confusing stuff going on. Anyway, on Github if there's an issue which doesn't involve a suggested change to a file, it is, well, an "Issue", rather than a Pull Request.

Hence I would consider using branches and PRs is only technically doable, but is a horrible implementation with grievous hacks and needlessly brittle due to tight coupling to a propriety web application.

The programmer's dream implementation

As a programmer, the obvious approach is to use a relational database such as ArangoDB. This is because of the lack of a clean sense of belonging to anything in particular. What I mean is that suggestion is made by a particular user, associated with a particular file, but directed towards another user, and the primary "view" we want to construct are suggestions directed towards a particular user, though we probably also want to integrate suggestions into translation view for a file so all proofreaders can see them. Relational databases excel at grabbing stuff from here and there and bringing it together as required. They are also great when it comes to dynamic data, stuff that is expected to change over time, like in this case the status of a suggestion and the addition of new suggestions, they are very fast, responsive and take care of race conditions (a programming term, for when two things happen "simultaneously", causing obscure and very hard to track down bugs).

Essentially it's great in every way except one teeny tiny flaw: the suggestions aren't included in the git repository. Hence while the app doesn't need or even want it for core functionality, the desire to preserve history means suggestions should be written into the git repository... somewhere, somewhen.

I do propose using a ArangoDB collection as the primary source of truth for suggestions, and synchronizing that data into the git repository as a historical record and backup, the focus would be on one-way synchronization, but it would be able to restore the collection from the git repository. Manually editing the suggestions in the git repo would not be a well supported workflow. (e.g. if there is a "suggestion" which isn't well supported by the Bilara app, just make an Issue or a PR with reference to the translation file. Nevertheless under exceptional circumstances suggestions could be edited then imported into Bilara, but this would be something like a format change)

Should history accumulate in the present or get buried?

When we want something to remain as part of a git repository history, there are essentially two approaches which can be taken. The first is to make all the history exist at present. The other way is to overwrite stuff, but it can be unconvered with a tool like "git log". As an example, when a translation is modified, the old translation still exists in the branch history but is completely gone from the present (working tree), but in principle it wouldn't be impossible to have a record of all past versions, like time-stamped or something.

In this case we could make it that a suggestion which isn't active (has been resolved by being accepted or rejected) is simply deleted from the present, and it only exists as commits in git history. The present would only contain active suggestions. To see a history of resolved suggestions would require using something like git log or git k to review the history of the relevant suggestion file. It is an approach which keeps the working tree uncluttered and up to date and would mirror the way old versions of translations work: where if you really want to see the history you can use git log.

Alternatively all suggestions could remain in the present, in their final concluded state. This is more cluttered, but it is easier to review history, as for example seeing the history of every suggestion ever made for a segment id would be as simple as loading the relevant file in a text viewer. This is especially easier for software, as trawling back through git history isn't that fun for code. For example if we want to be able to see, in Bilara, the history of suggestions for a segment, then this is definitely the way to go, if we only want to see active suggestions then either approach is fine.

We might want to see the complete history is that especially when there are multiple proof readers then it might be nice to quickly see past suggestions, even those which were accepted or rejected. This also helps indicate the "proofreading coverage" for the text.

But where?

So previously I mentioned that suggestions can be associated with multiple things: a file, the translator for that file, the person who made the suggestion.

Perhaps the most obvious basis for "what file to stick the suggestion into", would be a standoff file that incorporates the name of the file it is associated with.

Possibilities such as:

dn1_translation-en-sujato.suggestions.json
suggestions/.../dn1_translation-en-sujato.json

However something important to note here is that it is not easy to shoehorn a suggestion into our standard standoff format, which is simply key:string pairs, this is clumsy for suggestions which have quite a lot of attached information, and are not in a 1:1 ratio (e.g. there is one translator comment for each segment id in a translation, but there could be more than 1 suggestion). The data for a suggestion would ideally look something like this:

[
  {
    "user": "yoda",
    "segment_id": "dn1:1.1.1",
    "file": "dn1_translation-en-sujato",
    "original_text": "So I have heard. ",
    "suggestion": "Heard thus did I",
    "comment": "",
    "status": "rejected",
  }
]

(^ that's pretty much what the entry would look like in the database collection)

We could also structure it something like this:

In a file named something semantically equivalent to "suggestions/dn1_translation-en-sujato.json"

{
  "dn1:1.1.1": [
    {
      "user": "yoda",
      "original_text": "So I have heard. ",
      "suggestion": "Heard thus did I",
      "comment": "",
      "status": "rejected",
    }
  ]
}

But in any case, returning to the point that it is not standard standoff data, it might be desirable to make the file name reflect that, like how we put an underscore before filenames that contain configuration.

Earlier I proposed a filename such as "dn1_translation-en-sujato.suggestions.json", it would be using a deliberately weird filename for weird contents. This could also be described as second level standoff data, like the first level is root string, translation string, markup string. But the suggestion is actually specifically with reference to the translation string, it can't stand alone unlike the root or translation. A name like dn1_translation-en-sujato.suggestions.json or dn1_translation-en-sujato_suggestions.json would reflect this second level nature rather than masquerading as first level data.

The Summary and critical questions:

  1. The use of Pull Requests is absolutely not an option, accept no compromise.
  2. Storing the suggestions primarily in ArangoDB with it being backed to the git repo is by far the most straightforward and painless implementation.
  3. How interested are we in seeing resolved suggestions? Should they be updated as "Accepted"/"resolved" and written to the suggestions file, or should they be removed from the suggestion file with accepted/rejected noted in the commit message? If we want to quickly be able to see all past suggestions for a segment, or the like, then that informs how the data should be stored so Bilara can quickly fetch it for rendering views.
  4. What should the form and naming of suggestion files be? The simplest is literally just a single JSON file dumped from the database, everything in one file, serving mainly as a backup but technically allowing history to be reviewed by a sufficiently motivated human. A more human friendly scheme could be devised which uses logical stand-off style structuring.
sujato commented 2 years ago

The use of Pull Requests is absolutely not an option, accept no compromise.

Okay! I bow to your gitliness!

Storing the suggestions primarily in ArangoDB with it being backed to the git repo is by far the most straightforward and painless implementation.

Great, let's do that then.

How interested are we in seeing resolved suggestions?

Not very? Looking back at my previous statements, these things don't seem so important to me now. It might be occasionally helpful for a translator, but generally speaking, you just want to see the suggestion, accept or reject it, then move on.

What should the form and naming of suggestion files be? The simplest is literally just a single JSON file dumped from the database

Then do that!

For viewing the history, a non-technical user can use githistory:

https://githistory.xyz/suttacentral/bilara-data/blob/published/translation/en/sujato/sutta/mn/mn100_translation-en-sujato.json

In fact, I wonder if we could provide this as an option for viewing the history of a file? A link on the Bilara page to the githistory of the different things—translation, suggestions, comments, etc. Anyway, just a thought!