presciencelabs / tabitha-editor

0 stars 0 forks source link

Use Tokens to generate back translation #96

Closed craigp-atw closed 1 month ago

craigp-atw commented 3 months ago

As noted in #58 , the backtranslation sometimes needs to know about structure that a simple regex can't help with. The token process is designed to handle structure, and so it would be beneficial to integrate it into the backtranslation process.

Doing this would help work towards #23 and #61 as well as #58 .

craigp-atw commented 3 months ago

I foresee three stages of the backtranslation:

  1. Structural rules
    • take token output of the sense-selection rules (before the checker rules)
    • these will be rule-based (likely hard-coded, not json)
    • examples:
      • implicit markers
      • commas around descriptive relative clauses
      • imperatives
      • dynamic/literal expansion
      • 'named' constructions
      • etc.
  2. Textify
    • turns tokens into plain text (not rule-based)
    • examples:
      • [, ], and underscore notes become empty string
      • tokens with pronouns become just the pronoun
      • pairings become just the complex word
      • senses are removed from lookup tokens
      • etc.
  3. Find/replace
    • regex-based for simple replacements
    • examples:
      • numbers -> words
      • neighboring implicits (remove '>> <<')
      • remove hyphen from hyphenated words
      • etc.