sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

Bible Machine Alignments #78

Closed johnml1135 closed 2 years ago

johnml1135 commented 3 years ago

This issue is to track the progress to making generally available 100's if not 1000's of Bible translations with machine alignments. This has a few audiences:

Brainstorming implementation:

johnml1135 commented 3 years ago
johnml1135 commented 3 years ago

First steps:

Audience 1:

johnml1135 commented 3 years ago

Another update from talking with @jonathanrobie:

johnml1135 commented 3 years ago

Use this function as a basis: https://github.com/sillsdev/silnlp/blob/dfdb45fe44a0ff625cc153077291671a8c8c8445/silnlp/alignment/utils.py#L73

scores.alignment.txt (new file - scores) sym-align.txt (alignment)

johnml1135 commented 3 years ago

The data is also here: S:\MT\experiments\de-to-en-WMT2020+Bibles_AE\abp-en

ddaspit commented 3 years ago

As a point of reference for how long it should take to align a single translation, I was able to align a translation with ~13000 verses in ~35s on my machine. My machine has an Intel i7-9700KF with 8 cores.

johnml1135 commented 3 years ago

I have a fairly old machine: Intel(R) Core(TM) i7-4800MQ CPU and takes 15 minutes if I don't multithread (from start to alignments complete). I wonder if there is a switch for multithreading or not - with those two things it should explain the whole difference (about 4x for newer processor, 8x for multicore).

johnml1135 commented 3 years ago

Need to do:

johnml1135 commented 3 years ago

Fast align documentation: http://mt-class.org/jhu/slides/lecture-ibm-model1.pdf

johnml1135 commented 3 years ago

Bibles:

Use HMM because it should be better with different typologies. Do Hebrew and Greek Add to Google Drive Partnership .../Data/Alignments

johnml1135 commented 3 years ago

Follow up:

johnml1135 commented 3 years ago

Extract keyterms from paratext projects - compare it to the translation alignment model Greek and Hebrew Lemma surface forms If wanting max quality, how about use a different pivot - Septuagint? NASB? Versification sniffing:

johnml1135 commented 2 years ago

This work is closing for the time being. Priorities are shifting and there is no present use for it. Flagging potential versification errors ended up being much easier than using this model.