j0ma / ancestral-decipherment

1 stars 0 forks source link

Notes on first decipherment experiments #2

Open j0ma opened 1 year ago

j0ma commented 1 year ago

Using PanLex alone is not feasible as the strings are too short. This will cause the frequency-encoded "ciphertexts" to be awfully short and, as a result, many words get encoded to something like 0 1 2 3 4 5. While this might've been a good idea with plain decipherment with tons of data, it's a bit hard to do with just a bilingual dictionary.

Understandably the results were abysmal with SER over 100%. To fix this, I moved from PanLex to Tatoeba. Those experiments are still running but seem to exhibit better loss dynamics (no divergence).

j0ma commented 1 year ago

PanLex overfit (similar romanized or not):

image

Tatoeba:

Hebrew script: image

Romanized (using ibleaman/yiddish)

image

j0ma commented 1 year ago

Given that using dictionaries alone seems infeasible, would need parallel data for this?

Or maybe build "fake sentences" by simply concatenating individual words? That way no need to hunt for parallel data.

j0ma commented 1 year ago

Both models trained for 20 epochs using 1xV100 GPU with update_freq=4, i.e. training with delayed updates simulating 4xGPUs.

Transformer model from paper

SER Accuracy    Language
0.163   27.983  multi

Halved learning rate

Halving the learning rate seems to bring gains

SER Accuracy    Language
0.139   31.326  multi