Open j0ma opened 2 years ago
PanLex overfit (similar romanized or not):
Tatoeba:
Hebrew script:
Romanized (using ibleaman/yiddish)
Given that using dictionaries alone seems infeasible, would need parallel data for this?
Or maybe build "fake sentences" by simply concatenating individual words? That way no need to hunt for parallel data.
Both models trained for 20 epochs using 1xV100 GPU with update_freq=4
, i.e. training with delayed updates simulating 4xGPUs.
SER Accuracy Language
0.163 27.983 multi
Halving the learning rate seems to bring gains
SER Accuracy Language
0.139 31.326 multi
Using PanLex alone is not feasible as the strings are too short. This will cause the frequency-encoded "ciphertexts" to be awfully short and, as a result, many words get encoded to something like
0 1 2 3 4 5
. While this might've been a good idea with plain decipherment with tons of data, it's a bit hard to do with just a bilingual dictionary.Understandably the results were abysmal with SER over 100%. To fix this, I moved from PanLex to Tatoeba. Those experiments are still running but seem to exhibit better loss dynamics (no divergence).