Add AQUA as word alignment engine

johnml1135 commented 1 month ago

This is to add the AQUA missing words assessment to Serval.

Implementation option 1 (fully modular):

Use a new folder: Assessment
Move some stuff from machine to ServiceToolkit, especially the ClearML stuff
Use the same engine/job combo project

Implementation option 2 (start combining):

Rename Machine folder to Backend/Calculations/NLP_Engines, etc.
Combine Machine Engine and Job into one deployment
Add AQUA assessment to that deployment
Create a few more projects (as needed) to hold the unique aspects of the various engines (translation and assessment).

johnml1135 commented 1 month ago

Do the following to the existing files:

Keep Serval and the other primary folders as is.
Rename Machine as Backend
- Rename Serval.Machine.Shared as Serval.Backend.Shared
- Combine Serval.Machine.EngineServer and Serval.Machine.JobServer into Serval.Backend.BackendServer
- Add Aqua assessments (and future things as well) to that backend server
- Pull out parts of Serval.Backend.Shared into Serval.Backend.Machine - the parts that are specific to translation such as:
- TranslationEngine
- TrainSegmentPair
- Corpus
- Anything with Smt, Nmt or Thot
- Create a new project called Serval.Backend.Aqua and include
- The same job, build, ClearML, etc. aspects from Machine as is
- A new job runner for AQUA word alignment job
- Two assessments from the data:
  - A formal equivalence assessment, that is, one number per verse/segment
  - A source word assessment, giving a number associated with each source word per verse
- One engine can create both assessments
- Other "reference AQUA engines" can be created and then compared against the primary AQUA engine
- The API Layer hands down "use this corpora, here are your reference engines" and the AQUA Backend passes back "here are the fully calculated scores to pass to Lynx".
- Lynx then determines the relevance and meaning of the two AQUA metrics

@ddaspit, what do you think?

johnml1135 commented 1 month ago

Ok - we will abandon the assessment API for right now and make a word alignment API.

Call it AQUA enhanced word alignment
Add thot word alignment
Add the word alignment to the existing EngineServer and JobServer - but keep those docker containers separate.
Add a new Serval.AQUA.WordAlignment project under Machine (if needed - if we do the Z-score in Serval).
Start with planning out the new API layer and adding the machine.py normal word alignment.

johnml1135 commented 1 month ago

@ddaspit, what do you think - the basic refactoring would be:

There are base "Engine", "Corpora", "Job" classes
Inheritance tree:
- Engine - Name, Id, Revision, etc.
- TrainingEngine (Source and Target Corpus)
  - TranslationEngine - no changes
  - WordAlignmentEngine - no changes
- AssessmentEngine
- Corpus - a single language and set of files
- TrainingCorpus - Source and Target sets of files
- FilteredCorpus - a corpus reference with textIds and ScriptureRef filtering
- Job - state,
- TrainingBuildJob - IsPersisted, FilteredCorpus etc.
  - TranslationBuildJob - add pretranslations
  - WordAlignmentBuildJob - add word alignments
- AssessmentJob - FilteredCorpus

Keep all API and database things the same. This is refactoring with 0 other changes. Just get ready for WordAlignment, don't add it yet.

johnml1135 commented 1 month ago

@ddaspit - How should we represent word alignments at the Serval API layer? Here is the interface that I am assuming:

When training an engine, you can also have Serval align a portion of the training data, or other data
After training, the user can pass a source and target segment (with an optional scripture reference) to be aligned
Data back needs to include:
- The aligned words
- A single metric signifying the quality of the alignment
- Indication of the tokenization of the source and target sentences

Options:

Just take word pairs and a score -> John:Juan:0.89
Just take number pairs and a score -> 7:8:0.89
Add both by using a "|" -> 7|John:8|Juan:0.89

Use json to add the tokenization:

{
source_tokenization: ["His", "name", "is", "John"],
target_tokenization: ... ,
alignment: 1:1:0.89, 2:2:0.7546
}

ddaspit commented 1 month ago

You should take a look at the TranslationResult model for inspiration. We will probably want a subset of the properties in that model, specifically SourceTokens, TargetTokens, and Alignment.

sillsdev / serval

Add AQUA as word alignment engine #495