I am sorry this issue is not directly related to the project.
In MT, some words/phrases are not translated, but copied from source sentences, such as person names, company names, etc. It occurs to me that there could be two approaches:
Use shared vocabularies for both source and target languages; however, one one hand, the Vocab size could be very large; and one the other hand, MT may be unaware what words/phrases that needn't be translated unless it sees in the training set.
Use pre-process, for example, to detect the words/phrases as named entities, rare words, etc, and replace them with special tokens. I have tried Spacy NER, which is not accurate enough in practice.
I tried Google translate and other translate apps, and to some extend, I found their systems can determine the copied words/phrases, though not perfectly. Could someone advise, in general, what is the best solution to this problem? Thanks.
I am sorry this issue is not directly related to the project.
In MT, some words/phrases are not translated, but copied from source sentences, such as person names, company names, etc. It occurs to me that there could be two approaches:
Spacy
NER, which is not accurate enough in practice.I tried Google translate and other translate apps, and to some extend, I found their systems can determine the copied words/phrases, though not perfectly. Could someone advise, in general, what is the best solution to this problem? Thanks.