tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.33k stars 3.47k forks source link

Problem about to translate Chinese into English, some chinese word appeared in target english translation #111

Closed lqfarmer closed 7 years ago

lqfarmer commented 7 years ago

At first, thank you very much for releasing such good tools for Seq2Seq problems. Those days, I used tensor2tensor to translate Chinese to English. And I got two problems. I hope you can give some suggestions or advice. First, when I enlarge the vocabulary size to 300000, which is original 32768, during the testing time, some Chinese word appeared in target English translation. I used the give example, wmt_ende_tokens_32k, just replace the source language with Chinese, and target language with my English corpus. how does this happened.

Second, I got confused about the realization code when I see following function


def transformer_prepare_encoder(inputs, target_space, hparams): """Prepare one shard of the model for the encoder.

Flatten inputs.

ishape_static = inputs.shape.as_list() encoder_input = inputs encoder_padding = common_attention.embedding_to_padding(encoder_input) encoder_self_attention_bias = common_attention.attention_bias_ignore_padding( encoder_padding)

Append target_space_id embedding to inputs.

emb_target_space = common_layers.embedding( target_space, 32, ishape_static[-1], name="target_space_embedding") emb_target_space = tf.reshape(emb_target_space, [1, 1, -1]) encoder_input += emb_target_space if hparams.pos == "timing": encoder_input = common_attention.add_timing_signal_1d(encoder_input) return (encoder_input, encoder_self_attention_bias, encoder_padding)


in following code:


emb_target_space = common_layers.embedding( target_space, 32, ishape_static[-1], name="target_space_embedding") emb_target_space = tf.reshape(emb_target_space, [1, 1, -1]) encoder_input += emb_target_space


I can't understand: why you need to add target embedding into input, when prepare date for encoder ?I don't find any clue in orginal paper. I feel that this is close to my first question. And, How dose " common_layers.embedding"function doing, especially about number "32",why you choose this number. thank you very much, I am looking forward about you reply.

lukaszkaiser commented 7 years ago

Hi @lqfarmer , great to hear from you! To your questions:

(1) About emb_target_space: this is a single token, one of <32 currently (here where 32 comes from), which tells the model which language to translate to. It's only useful in multiple-language training, where the model needs to know which language to translate to. You can find the list of langauges/problems we tried here: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem_hparams.py#L161

If you're doing only 1 language, you don't need to worry about that.

(2) About vocabulary: we share the vocabulary between source and target language, and the weights between softmax and embedding: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L289

This allows the model to copy things from input to output, e.g,, copy Chinese names to English when needed. But until the model is good, it might result with Chinese symbols on the English side, yes.

If you're working on Chinese translation, it'd be great if you could make a PR adding the data-set, then I'll be able to help more!

lqfarmer commented 7 years ago

Thank you very much for your suggestions. I still have a question: why you choose to deal the source and target token into one vocabulary ? Do like that lead to a computation problem: it consume a lot of memory during the training and batch sizze cann't to set to large. it is a bit lower efficiency. I tried to separate source language and target language into different vocabulary, it works. But, after training, I find a strange situation is that a lot of English tokens do not translate right. I guess the reason is that the English vocabulary does not work well. Could you give some suggestions ? Thank you.

lukaszkaiser commented 7 years ago

Both ways (shared vocabulary and not-shared) have their advantages and disadvantages. When it's shared, it's easier for the model to learn to copy proper names, like names of towns or people -- copying is very easy. The disadvantage, as you said, is in the size, in the worst case you can double your softmax size. But it's only softmax (embeddings are cheap as it's a tf.gather). In the distributed setting, the softmax is also sharded, and with word-pieces the vocabulary of 32k is often enough for most things, as it contains all the key pieces anyway. So it's a tradeoff, both choices are reasonable. A single vocabulary also helps in multi-lingual models, which was the decisive point for us.

lqfarmer commented 7 years ago

I got it, thank you very much.

cshanbo commented 7 years ago

Hi @lukaszkaiser , I ran my experiments on Chinese-English (zh-en) translation task, too. If I set my experiments with shared vocabulary, it works just fine. But just as you and @lqfarmer said, there might be Chinese characters appear in the English side, which is not a reasonable case in zh-en tasks. So I instead defined the problem like:

def translate_zhen(model_hparams):
  """Chinese to English translation benchmark."""
  p = default_problem_hparams()
  # This vocab file must be present within the data directory.
  source_vocab_filename = os.path.join(model_hparams.data_dir, "vocab.zh")
  target_vocab_filename = os.path.join(model_hparams.data_dir, "vocab.en")
  source_token = text_encoder.TokenTextEncoder(vocab_filename=source_vocab_filename)
  target_token = text_encoder.TokenTextEncoder(vocab_filename=target_vocab_filename)
  p.input_modality = {"inputs": (registry.Modalities.SYMBOL, source_token.vocab_size)}
  p.target_modality = (registry.Modalities.SYMBOL,
                       target_token.vocab_size)
  p.vocabulary = {
      "inputs": source_token,
      "targets": target_token,
  }
  p.loss_multiplier = 1.4
  p.input_space_id = 16
  p.target_space_id = 4
  return p

Where, the input_modality is set to the size of source_token.vocab_size, while the target_modality set to target_token.vocab_size.

In such setting, if the shared_embedding_and_softmax_weights is set to 1 (default setting), the code will encounter an exception like:

INFO:tensorflow:Performing local training.
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6']
INFO:tensorflow:caching_devices: None
Traceback (most recent call last):
  File "../../tensor2tensor/bin//t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "../../tensor2tensor/bin//t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/trainer_utils.py", line 531, in run_locally
    exp.train()
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-

xxxxxxxxxxxxxxxxxxxxxx skipped xxxxxxxxxxxxxxxxxxxxxxxxxxxx

  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/expert_utils.py", line 260, in DaisyChainGetter
    var = getter(name, *args, **kwargs)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 682, in _get_single_variable
    "VarScope?" % name)
ValueError: Variable symbol_modality_50001_1024/shared/weights_0 does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

While set shared_embedding_and_softmax_weights=0, the exception disappears. I'm wondering if there's anything I forgot to change in modalities.py.

What I understand is that, in symbolmodality, bottom_simple is to build the bottom part of input and target. The dimension might be mismatched if using separated vocabularies and shared_embedding_and_softmax_weights==1.

Is it something like tied embedding technique?

Please help.

lukaszkaiser commented 7 years ago

I think what you see is expected: shared_embedding_and_softmax_weights=1 means we want to share all embedding weights: source embeddings = target embeddings = softmax weights. That's not possible when the vocabularies are different, as you say, so we get an error. We could implement sharing just for targets, didn't do that yet though (and I'm not sure that'd help much).

Happy to see you're getting some results for zh-en, how about making a PR with your params?

cshanbo commented 7 years ago

Hi @lukaszkaiser

wmt 2017 opened a zh-en translation task, which can be added to tensor2tensor wmt tasks. The differences are:

  1. Chinese sentences need to be word-segmented first, where we could use some open source project like jieba.
  2. It might be better if we use separated vocabularies for Chinese and English.

I'll try to follow the methodology of wmt_ende_bpe_32k problem.

BTW, I'm thinking about using just part of the data to as an example for zh-en translation. Do you think it's a better idea to upload all the training data to google drive, which is of about size 25 million sentence pairs?

jekbradbury commented 7 years ago

Chinese almost certainly doesn't need to be word-segmented for a model like this; using characters or a BPE/sentencepiece approach should work just fine. Either way a separate vocabulary is probably not a bad idea.

cshanbo commented 7 years ago

Hi @jekbradbury I'll try to use bpe instead. Thank you for reminding.

lukaszkaiser commented 7 years ago

I think you should do it the same way as wmt_ende_tokens_32k, i.e., use the internal tokenizer. It's been improved in recent releases, so it should handle Unicode and Chinese well enough. Looking forward to the PR :).

cshanbo commented 7 years ago

Hi @lukaszkaiser Thank you for reminding.

But I think the Chinese sentences still need to be word-segmented first because here are no spaces within a Chinese sentence.

A typical Chinese sentence in training data is like

1929年还是1989年?

If we don't segment the sentence first, we will get the Chinese vocab like (part of vocab):

'页岩气的出现让能源争论更加扑朔迷离了_'
'页'
'音_'
'音'
'韩国选民会把_'
'韩国总统金大中对前日本首相小渊惠三的讲话予以积极回应_'
'韩国将_'
'韩国和日本之间的冲突_'
'韩国各地各年龄段的选民都欢迎朴槿惠参选总统_'
'韩国保守运动的任何人都不怀疑朴是他们当中的一份子_'
'韩国以及台湾以出口主导型增长实现了经济的快速赶超_'

because the whole sentence is treated as a single word in SubwordTextEncoder.

As a conclusion, the pre-processing procedure for Chinese should at least include:

  1. word segmentation
  2. tokenization
  3. word piece segmentation
  4. vocabulary generation, etc

Correct me if anything wrong.

jekbradbury commented 7 years ago

I see. That appears to be a drawback of the SubwordTextEncoder approach. But both traditional BPE and https://github.com/google/sentencepiece should still work.

cshanbo commented 7 years ago

Hi @jekbradbury Yes. I think it's a great idea to use sentencepiece.

I noticed that in the current t2t scheme, the wmt_ende_bpe32k experiment used a preprocessed data.

Basically there're 2 ways to add Chinese-English task:

  1. Preprocessing all data first, then upload the data to google drive just like t2t did.
  2. Incorporating the preprocess procedure into current t2t.

The 1st way is consistent with current t2t scheme, while the 2nd one shows how to run a translation task from scratch. I've done wrote and testing the codes, except the data processing (the 1st way or the other one).

If the 1st (personally prefer to), I need some time to download, process, and upload the data;

Otherwise, we might need to add other dependancies like sentencepiece, which might not be appropriate for the t2t code scheme.

lukaszkaiser commented 7 years ago

Sentencepieces are great, but I still believe our simple built-in wordpiece tokenizer will do just as well.

@cshanbo : I believe you don't need to worry about word-segmenting. You're right that it'll start will all sentences, barely split. But it'll choose the most commonly occurring ones, which are often single words. And then it'll build pieces on top of that, and - in the latest version - include all Unicode characters from the corpus too. Meaning all characters -- for CJK models splitting on characters rarely do any worse than word-segmented ones, so even that'd be enough.

Still, these are all theories. The best way to test would be to run it. Did you try the tensor2tensor pipeline on the enzh corpus data? How does it tokenize afterwards? How does it train? Experiments are the best answer :).

cshanbo commented 7 years ago

The BLEU score of zhen on wmt, trained 100000 steps on 1 card, with 200,000 training data, is only 0.89, which makes me believe it's better to segment the sentences first.

lukaszkaiser commented 7 years ago

@cshanbo : could you make a PR with this dataset? I'd be very curious to try it too, it seems strange that tokenization would matter. Maybe a lower learning rate? If make a PR, I'll be able to try running it too :).

lqfarmer commented 7 years ago

Hi,@lukaszkaiser. Could you tell me, what is the difference between space id of English tokens and English bpe tokens? In my mind, bpe tokens can also be saw as tokens, so there is no need to constrcut a new bpe tokens. when you do ende bpe problems. Your "vocab.bpe.32000" just a set of bpe tokens? like following is my bpe vocabulary:


the 的 . of and 和 to in a 在 不 ) ( on 这 for 上 is ......................


cshanbo commented 7 years ago

Hi @lukaszkaiser I will make a PR later today. Just as a reminder:

  1. I used the training data training-parallel-nc-v12.tgz and develop data dev.tgz given here.
  2. In dev.tgz, for zhen tasks, there are no ready-to-use plain text file available, but only sgm file used for mt-eval. We need to pre-process the newsdev2017-zhen-src.zh.sgm and newsdev2017-zhen-ref.en.sgm to plain text first.
lukaszkaiser commented 7 years ago

@lqfarmer : you're right that BPE is almost the same as our subword tokenizer (also sometimes called wordpiece, WPM); but some BPE implementations drop some spaces, don't include letters and don't care about Unicode, so some results might be different. These are usually small things though, sometimes they might make no difference at all.

@cshanbo : I'm super happy to hear you're getting good results :). The files you mention are all right, moving from .sgm to text should be very easy, it's just removing tags, right? You could do it directly in python while reading the files, or separately, as you prefer of course. Thanks!

colmantse commented 7 years ago

Hi all, just wondering if the current version of the zhen wmt 17 task is full size or is truncated, their size looks really small comparing to ende and enfr, a magnitude smaller for the former and 2 magnitude smaller for the latter. Also it is interesting to see enzh_rev's approx bleu is close to 15 point lower than enzh in a 120k step setting.

martinpopel commented 7 years ago

@colmantse: T2T enzh training data consist just of News Commentary v12. But WMT17 provided also links to the UN Parallel Corpus V1.0 (which can be downloaded after free registration) and CWMT Corpus (where a password for ftp download is provided).

colmantse commented 7 years ago

Thanks @martinpopel , so if I need to train a model with the whole data, then I would need to download it and upload to the google drive and rewrite the link at the enzh problem?

martinpopel commented 7 years ago

@colmantse: you can download the whole data to your t2t_tmp directory (and possibly pre-process to one of the supported formats). If T2T finds the file there, it won't try to download it (so you can keep the download url empty or whatever). This way the experiment won't be easily replicable by others. Uploading the data to a google drive and making the download fully automatic would be ideal for T2T user, but I am afraid this is not possible for legal reasons here. Neither UN v1.0 nor CWMT comes with a licence (well, I could not find any licence) allowing redistribution and I guess there is a reason behind the registration (the UN corpus authors had a lot of work with providing it for free and cleaned-up and I guess the number of registered users may be important for their grant agencies). That said, you can ask the authors and/or implement automatic download of CWMT via ftp from its original location.

colmantse commented 7 years ago

@martinpopel Thank you for your heads up. I will see if I can get it working with the readme.

colmantse commented 7 years ago

@martinpopel A quick question, do I simply leave blank the link and put down the names of self downloaded datasets in order to use them? datasets = _ENDE_TRAIN_DATASETS if train else _ENDE_TEST_DATASETS

_ENDE_TRAIN_DATASETS = [

    [

        "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz",  # pylint: disable=line-too-long

        ("training/news-commentary-v12.de-en.en",

         "training/news-commentary-v12.de-en.de")

    ],

    [

        "http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz",

        ("commoncrawl.de-en.en", "commoncrawl.de-en.de")

    ],

    [

        "http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz",

        ("training/europarl-v7.de-en.en", "training/europarl-v7.de-en.de")

    ],
martinpopel commented 7 years ago

You can keep the http... download link empty (or arbitrary), but you need to provide the extracted files ("training/news-commentary-v12.de-en.en", "training/news-commentary-v12.de-en.de" in case of ende) in the t2t_tmp directory. You can either edit wmt.py and add the two datasets to _ZHEN_TRAIN_DATASETS, or you can follow the readme and create a new problem (with unique name) and specify it in my_registrations.py (or whatever file in a the directory provided in --t2t_usr_dir).

colmantse commented 7 years ago

thank you very much!

SkyAndCloud commented 6 years ago

@cshanbo @lukaszkaiser Hi, I've read all your discussion and I'm using t2t on en-zh translation, too. I want to know if I need to do Chinese word segmentation before t2t's preprocess. Have you compare the results of segmented Chinese and non-segmented Chinese?

connectdotz commented 5 years ago

hi,

is there information on expected BLEU? I am wondering how do we know the implementation/data/checkpoints are correct if we don't know what the target metrics should be...? I would really love to know the bleu score of, for example, translate_enzh_wmt32k/transformer_base_single_gpu approximately... has anybody achieved reasonable performance with the current impl? can you share some metrics and example translations?