sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

Automated aligned translation candidates #402

Open jcuenod opened 1 month ago

jcuenod commented 1 month ago

I've just been working with a translation that got pretty low alignment scores on the source text. I researched the language a bit and managed to find related languages that performed better.

This made me wonder whether it's not worth building an alignment "index" so we can identify clusters, run alignments against samples in each cluster and find decent candidates automatically. Have you guys solved this problem in some other way or done something like this?

ddaspit commented 1 month ago

You should check out the silnlp.alignment.visualize_similarity script. It computes the alignment scores for all project pairs in a country or language family. It then generates a hierarchical (dendrogram) or network graph based on the scores. It can also combine all of the scores by language, so that you can visualize the relationship between languages. It is intended to work on the biblical-humanities-corpus. This is a private repo that contains thousands of Bible translations. We could certainly extend it to support other clustering algorithms. Here is an example of the output: india-language-tree