Bible Machine Alignments

johnml1135 commented 3 years ago

This issue is to track the progress to making generally available 100's if not 1000's of Bible translations with machine alignments. This has a few audiences:

multi-corpora offline analysis
Seeds for ML work
Starting point for alignment to be used for enhanced resources All Bibles will be aligned to greek and hebrew, as is in the paratext projects.

Brainstorming implementation:

Process all Bibles as per a simple json file in the format:
- {alignment:fast-align { en-ulb: "all", es-ulb: "hi-ulb, fr-ulb"} }
Word by word alignment.
Store this json file and alignment_records.csv in a githib repo.
- the README.md has a description of the content as well as a massive table of links to all the alignments and overall alignment scores
- The alignments are stored in a public S3 bucket (just like Door43)
Create alignment_records.csv that contains:
- Source and target languages
- Alignment method
- overall alignment score
- timestamp
- S3 bucket link for alignment per verse
- S3 bucket link for alignment scores per verse
This data will normally be created on a local PC

johnml1135 commented 3 years ago

[x] source bibles: use the already extracted ones
[x] Identify repo to store extracted Bibles: https://github.com/BibleNLP/bible-parallel-corpus-internal
[x] Identify repo to store alignments: new repo called "bible-alignment-internal"
[ ] setup S3 bucket (read-only public)
- [ ] Only put alignments on it, not internal Bibles.
[ ] Generate a script to make 1 alignment as we want it to be made - review it to ensure it's suitability
[ ] Generate script in this repo to create all alignments as specified by the json file (that is, if they are not already made)
[ ] Generate 10 or so alignments and put it on GitHub - verify everything is working as expected including:
- [ ] should all these be zip files?
- [ ] licenses
- [ ] can upload alignments and alignment scores per verse to the S3 bucket
- [ ] can download from the S3 bucket
- [ ] only new files are created
- [ ] csv links work
[ ] Develop initial list of Bible pairs for the repo
[ ] Update the json file, run the script, upload to GitHub.
[ ] Release to desired audience

johnml1135 commented 3 years ago

First steps:

Align a set (10?) Bibles to greek and Hebrew with IBM-4
- Which Greek/Hebrew corpuses can we use? Use whatever Clear is using (so we can use Clear Engine) (Freely Licensed):
- Nestle1904
- SBLGNT (greek new testement)
- Wesminster Leningrad Codex (WLC)
- different versification between gr/hr and english. Currently uses English versification
Then selections as needed
Start with word level and then assess - most tools use word tokenized alignments

Audience 1:

general NLP/ML tasks looking for word-by-word translation choices among different languages Audience 2:
Alignments as drafts for curated alignments used by translators directly

johnml1135 commented 3 years ago

Another update from talking with @jonathanrobie:

He sees the work as very helpful in making enhanced resources
Getting the greek and hebrew text
[x] extracted GRK and HEB from Paratext - versification already applied
Initial alignment
[ ] auto-magical alignment using FastAlign
[ ] Analyze alignments (scores and alignments) for some stats
[ ] Find a way to queue up IBM-4 or similar for all alignments

johnml1135 commented 3 years ago

Use this function as a basis: https://github.com/sillsdev/silnlp/blob/dfdb45fe44a0ff625cc153077291671a8c8c8445/silnlp/alignment/utils.py#L73

scores.alignment.txt (new file - scores) sym-align.txt (alignment)

johnml1135 commented 3 years ago

The data is also here: S:\MT\experiments\de-to-en-WMT2020+Bibles_AE\abp-en

ddaspit commented 3 years ago

As a point of reference for how long it should take to align a single translation, I was able to align a translation with ~13000 verses in ~35s on my machine. My machine has an Intel i7-9700KF with 8 cores.

johnml1135 commented 3 years ago

I have a fairly old machine: Intel(R) Core(TM) i7-4800MQ CPU and takes 15 minutes if I don't multithread (from start to alignments complete). I wonder if there is a switch for multithreading or not - with those two things it should explain the whole difference (about 4x for newer processor, 8x for multicore).

johnml1135 commented 3 years ago

Need to do:

Save translation probability and alignment probability
- Damien to show example
Use FastAlign
Do 5-10 Bibles, Hebrew and Greek
- Align separately, combine into one file)
Put on Github

johnml1135 commented 3 years ago

Fast align documentation: http://mt-class.org/jhu/slides/lecture-ibm-model1.pdf

johnml1135 commented 3 years ago

Bibles:

English literal: en-NASB
English non-literal: en-NIV11
Spanish: es-NTV
Hindi: hi-HINDI-BSI
Korean: ko-RNKSV
Hausa: ha-HAU

Use HMM because it should be better with different typologies. Do Hebrew and Greek Add to Google Drive Partnership .../Data/Alignments

johnml1135 commented 3 years ago

Follow up:

Use as drafts for alignments for enhanced resources
Use as input for multilingual named entity recognition
Compare to human created alignments - are they good enough to use as is?
- Dynamically create alignments for other LWC's and Bible work
Augment versification sniffing
Bidirectional relationship with keyterms or dictionaries
Use as inputs to MT
- First, run alignment. Then create preselected keyterm list. Then use both to do NMT alignment.

johnml1135 commented 3 years ago

Extract keyterms from paratext projects - compare it to the translation alignment model Greek and Hebrew Lemma surface forms If wanting max quality, how about use a different pivot - Septuagint? NASB? Versification sniffing:

remove verses that have outlier sentence length distances
build alignment model with remaining verses
... continue versification sniffing... Align with known languages in our group:
English - French
English - German Run metrics on alignments - F-Score?

johnml1135 commented 2 years ago

This work is closing for the time being. Priorities are shifting and there is no present use for it. Flagging potential versification errors ended up being much easier than using this model.

sillsdev / silnlp

Bible Machine Alignments #78