multilingual engine with SCA

nicolabertoldi commented 5 years ago

@teslacool

I would like to use your software in a multilingual environment.

In practice, I would like to train one system for translating from English into Spanish and Italian. I already have these system working using a standard transformer architecture. To do this, I followed a quite standard procedure to add a language flag into the source text to trigger the right target translation (into Spanish or Italian).

In the same way, I also can train one system for translating from Spanish or Italian into English. In this case, no language flags are used; but I simply concatenate Spanish-English and Italian-English training data, and let the network do all the job.

I would like to know your idea about applying a similar strategy with a lm_translation task (i.e. a transformer plus LM). In the first case (en->{es,it}), the source LM would contain the language flag, and only English tokens, while the target LM would contain both Spanish and Italian words. In the second case ({es,it}->en), the source LM would contain both Spanish and Italian words, wile the target LM would be "standard". Would the LMs be strong enough to "distinguish" between Spanish and Italian tokens? Could the presence of the language flag disturb the quality of the LMs?

Do you see other approaches for creating a SCA multilingual engine (en->{es,it} or {es,it}->en)?

Any suggestions or comments are very welcome.

teslacool commented 5 years ago

sorry, I am not sure whether the LMs are strong enough to handle two languages.

I think you can do preliminary experiments to test whether a lm can model two language like It and Sp.. You can easily get conclusion on ppl number on two language test dataset.

nicolabertoldi commented 5 years ago

@teslacool

I also thought about the preliminary test you mentioned

the e xperimental seup should be the following

case A: get a Spanish corpus, train a LM on it, and evaluate (Perplexity) the model on a Spanish text; do the same with an Italian corpus
case B: get the Spanish and Italian corpora, train one LM on the concatenation of the two; evaluate the model on both the Spanish and Italian texts

The winning results should be that in both cases (A and B) the perplexity on both Spanish and Italian is the same.

But one question is: As the perplexity is somehow related to the dictionary size; should I use one vocabulary for both cases?

What do you think about?

teslacool commented 5 years ago

i think this is a good consideration that you use a shared and consistent dictionary for your experiments.

teslacool / SCA

multilingual engine with SCA #9