viking-sudo-rm / voynich2vec

Applying word2vec embeddings to the problem of deciphering the Voynich manuscript.
7 stars 0 forks source link

Parallel bible corpora #4

Closed chirila closed 6 years ago

chirila commented 6 years ago

http://legacydirs.umiacs.umd.edu/~resnik/parallel/bible.html - includes links to languages Paper on another project; includes links: http://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdf And another one: https://link.springer.com/article/10.1007%2Fs10579-014-9287-y And actual data: https://github.com/christos-c/bible-corpus/tree/master/bibles

chirila commented 6 years ago

Note to Will - this is for testing two things:

  1. how well word2vec-like approaches work on small parallel corpora, and
  2. how small an amount of text it works on.
viking-sudo-rm commented 6 years ago

Running these now on grace.hpc.yale.edu.

I've also written a script extract_xml.py that should extract plain text from Perseus and Bible Corpus-formatted XML files.

viking-sudo-rm commented 6 years ago

An observation:

bible.LA.txt has ~9000 words, whereas bible.GK.txt has ~3000. We'll see how this affects the alignment.

viking-sudo-rm commented 6 years ago

Vanilla run options for MUSE alignment:

python unsupervised.py --src_lang la --tgt_lang gk --src_emb ../models/bibleLatin.vec --tgt_emb ../models/bibleGreek.vec --n_refinement 5 --emb_dim 100 --dis_most_frequent 2804 --dico_max_rank 0 --epoch_size 100000

Things to try:

viking-sudo-rm commented 6 years ago

Running again, but this time aligning Greek into Latin and limiting to the 100 most common vocabulary items in each language:

python unsupervised.py --src_lang gk --tgt_lang la --src_emb ../models/bibleGreek.vec --tgt_emb ../models/bibleLatin.vec --emb_dim 100 --dis_most_frequent 100 --dico_max_rank 0 --epoch_size 100000

Training is underway. So far, the loss numbers look more monotonic.

How to call kalign for this run:

python kalign.py --bible --text gr-la-100 --model bibleLatin --src_lang gk

Update: did not work with dis_most_frequent=100. Trying again with 1000. Will also see what happens when I don't set epoch_size.

viking-sudo-rm commented 6 years ago

Result: After trying many different settings of hyperparameters, I am unable to get any meaningful alignment between the Latin and Greek bibles. All results in https://github.com/viking-sudo-rm/voynich2vec/tree/master/alignments/bibles

chirila commented 6 years ago

Limiting to most frequent in each language may be causing problems, since Greek has articles and Latin doesn't, and Latin has more case morphology than Greek (to Greek has more common prepositions).

On Fri, Jun 1, 2018 at 12:37 AM, Will Merrill notifications@github.com wrote:

Result: After trying many different settings of hyperparameters, I am unable to get any meaningful alignment between the Latin and Greek bibles. All results in https://github.com/viking-sudo-rm/voynich2vec/tree/ master/alignments/bibles

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/viking-sudo-rm/voynich2vec/issues/4#issuecomment-393755499, or mute the thread https://github.com/notifications/unsubscribe-auth/AP8oR2Ij8aFn4h2iLWtuLzD0BPMj3_N7ks5t4MT3gaJpZM4UF1h- .

--

Claire Bowern Professor, Director of Graduate Studies Chair: Yale Women Faculty Forum (wff.yale.edu) Department of Linguistics New Haven, CT 06511

chirila commented 6 years ago

Can you do a bunch of "self" alignments for these languages? If we end up with consistent patterns in what types of morphology align, that would be a way to mae some guesses about Voynich morphology.

On Fri, Jun 1, 2018 at 9:38 AM, Claire Bowern claire.bowern@yale.edu wrote:

Limiting to most frequent in each language may be causing problems, since Greek has articles and Latin doesn't, and Latin has more case morphology than Greek (to Greek has more common prepositions).

On Fri, Jun 1, 2018 at 12:37 AM, Will Merrill notifications@github.com wrote:

Result: After trying many different settings of hyperparameters, I am unable to get any meaningful alignment between the Latin and Greek bibles. All results in https://github.com/viking-sudo-rm/voynich2vec/tree/master/ alignments/bibles

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/viking-sudo-rm/voynich2vec/issues/4#issuecomment-393755499, or mute the thread https://github.com/notifications/unsubscribe-auth/AP8oR2Ij8aFn4h2iLWtuLzD0BPMj3_N7ks5t4MT3gaJpZM4UF1h- .

--

Claire Bowern Professor, Director of Graduate Studies Chair: Yale Women Faculty Forum (wff.yale.edu) Department of Linguistics New Haven, CT 06511

--

Claire Bowern Professor, Director of Graduate Studies Chair: Yale Women Faculty Forum (wff.yale.edu) Department of Linguistics New Haven, CT 06511

viking-sudo-rm commented 6 years ago

For organizational reasons, closing this issue and opening another one with your suggestion.