Closed lqfarmer closed 7 years ago
Hi @lqfarmer , great to hear from you! To your questions:
(1) About emb_target_space
: this is a single token, one of <32 currently (here where 32 comes from), which tells the model which language to translate to. It's only useful in multiple-language training, where the model needs to know which language to translate to. You can find the list of langauges/problems we tried here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem_hparams.py#L161
If you're doing only 1 language, you don't need to worry about that.
(2) About vocabulary: we share the vocabulary between source and target language, and the weights between softmax and embedding: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L289
This allows the model to copy things from input to output, e.g,, copy Chinese names to English when needed. But until the model is good, it might result with Chinese symbols on the English side, yes.
If you're working on Chinese translation, it'd be great if you could make a PR adding the data-set, then I'll be able to help more!
Thank you very much for your suggestions. I still have a question: why you choose to deal the source and target token into one vocabulary ? Do like that lead to a computation problem: it consume a lot of memory during the training and batch sizze cann't to set to large. it is a bit lower efficiency. I tried to separate source language and target language into different vocabulary, it works. But, after training, I find a strange situation is that a lot of English tokens do not translate right. I guess the reason is that the English vocabulary does not work well. Could you give some suggestions ? Thank you.
Both ways (shared vocabulary and not-shared) have their advantages and disadvantages. When it's shared, it's easier for the model to learn to copy proper names, like names of towns or people -- copying is very easy. The disadvantage, as you said, is in the size, in the worst case you can double your softmax size. But it's only softmax (embeddings are cheap as it's a tf.gather
). In the distributed setting, the softmax is also sharded, and with word-pieces the vocabulary of 32k is often enough for most things, as it contains all the key pieces anyway. So it's a tradeoff, both choices are reasonable. A single vocabulary also helps in multi-lingual models, which was the decisive point for us.
I got it, thank you very much.
Hi @lukaszkaiser , I ran my experiments on Chinese-English (zh-en) translation task, too. If I set my experiments with shared vocabulary, it works just fine. But just as you and @lqfarmer said, there might be Chinese characters appear in the English side, which is not a reasonable case in zh-en tasks. So I instead defined the problem like:
def translate_zhen(model_hparams):
"""Chinese to English translation benchmark."""
p = default_problem_hparams()
# This vocab file must be present within the data directory.
source_vocab_filename = os.path.join(model_hparams.data_dir, "vocab.zh")
target_vocab_filename = os.path.join(model_hparams.data_dir, "vocab.en")
source_token = text_encoder.TokenTextEncoder(vocab_filename=source_vocab_filename)
target_token = text_encoder.TokenTextEncoder(vocab_filename=target_vocab_filename)
p.input_modality = {"inputs": (registry.Modalities.SYMBOL, source_token.vocab_size)}
p.target_modality = (registry.Modalities.SYMBOL,
target_token.vocab_size)
p.vocabulary = {
"inputs": source_token,
"targets": target_token,
}
p.loss_multiplier = 1.4
p.input_space_id = 16
p.target_space_id = 4
return p
Where, the input_modality
is set to the size of source_token.vocab_size
, while the target_modality
set to target_token.vocab_size
.
In such setting, if the shared_embedding_and_softmax_weights is set to 1
(default setting), the code will encounter an exception like:
INFO:tensorflow:Performing local training.
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6']
INFO:tensorflow:caching_devices: None
Traceback (most recent call last):
File "../../tensor2tensor/bin//t2t-trainer", line 83, in <module>
tf.app.run()
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "../../tensor2tensor/bin//t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/trainer_utils.py", line 531, in run_locally
exp.train()
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-
xxxxxxxxxxxxxxxxxxxxxx skipped xxxxxxxxxxxxxxxxxxxxxxxxxxxx
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/search/odin/chengshanbo/git/sogou/tensor2tensor-speech/tensor2tensor/bin/../../tensor2tensor/utils/expert_utils.py", line 260, in DaisyChainGetter
var = getter(name, *args, **kwargs)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/search/odin/chengshanbo/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 682, in _get_single_variable
"VarScope?" % name)
ValueError: Variable symbol_modality_50001_1024/shared/weights_0 does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
While set shared_embedding_and_softmax_weights=0
, the exception disappears. I'm wondering if there's anything I forgot to change in modalities.py.
What I understand is that, in symbolmodality, bottom_simple
is to build the bottom part of input and target. The dimension might be mismatched if using separated vocabularies and shared_embedding_and_softmax_weights==1
.
Is it something like tied embedding technique?
Please help.
I think what you see is expected: shared_embedding_and_softmax_weights=1
means we want to share all embedding weights: source embeddings = target embeddings = softmax weights. That's not possible when the vocabularies are different, as you say, so we get an error. We could implement sharing just for targets, didn't do that yet though (and I'm not sure that'd help much).
Happy to see you're getting some results for zh-en
, how about making a PR with your params?
Hi @lukaszkaiser
wmt 2017 opened a zh-en
translation task, which can be added to tensor2tensor
wmt tasks.
The differences are:
I'll try to follow the methodology of wmt_ende_bpe_32k
problem.
BTW, I'm thinking about using just part of the data to as an example for zh-en translation. Do you think it's a better idea to upload all the training data to google drive, which is of about size 25 million sentence pairs?
Chinese almost certainly doesn't need to be word-segmented for a model like this; using characters or a BPE/sentencepiece approach should work just fine. Either way a separate vocabulary is probably not a bad idea.
Hi @jekbradbury I'll try to use bpe instead. Thank you for reminding.
I think you should do it the same way as wmt_ende_tokens_32k
, i.e., use the internal tokenizer. It's been improved in recent releases, so it should handle Unicode and Chinese well enough. Looking forward to the PR :).
Hi @lukaszkaiser Thank you for reminding.
But I think the Chinese sentences still need to be word-segmented first because here are no spaces within a Chinese sentence.
A typical Chinese sentence in training data is like
1929年还是1989年?
If we don't segment the sentence first, we will get the Chinese vocab like (part of vocab):
'页岩气的出现让能源争论更加扑朔迷离了_'
'页'
'音_'
'音'
'韩国选民会把_'
'韩国总统金大中对前日本首相小渊惠三的讲话予以积极回应_'
'韩国将_'
'韩国和日本之间的冲突_'
'韩国各地各年龄段的选民都欢迎朴槿惠参选总统_'
'韩国保守运动的任何人都不怀疑朴是他们当中的一份子_'
'韩国以及台湾以出口主导型增长实现了经济的快速赶超_'
because the whole sentence is treated as a single word
in SubwordTextEncoder.
As a conclusion, the pre-processing procedure for Chinese should at least include:
Correct me if anything wrong.
I see. That appears to be a drawback of the SubwordTextEncoder
approach. But both traditional BPE and https://github.com/google/sentencepiece should still work.
Hi @jekbradbury Yes. I think it's a great idea to use sentencepiece.
I noticed that in the current t2t
scheme, the wmt_ende_bpe32k
experiment used a preprocessed data.
Basically there're 2 ways to add Chinese-English task:
t2t
did.t2t
.The 1st way is consistent with current t2t
scheme, while the 2nd one shows how to run a translation task from scratch. I've done wrote and testing the codes, except the data processing (the 1st way or the other one).
If the 1st (personally prefer to), I need some time to download, process, and upload the data;
Otherwise, we might need to add other dependancies like sentencepiece, which might not be appropriate for the t2t
code scheme.
Sentencepieces are great, but I still believe our simple built-in wordpiece tokenizer will do just as well.
@cshanbo : I believe you don't need to worry about word-segmenting. You're right that it'll start will all sentences, barely split. But it'll choose the most commonly occurring ones, which are often single words. And then it'll build pieces on top of that, and - in the latest version - include all Unicode characters from the corpus too. Meaning all characters -- for CJK models splitting on characters rarely do any worse than word-segmented ones, so even that'd be enough.
Still, these are all theories. The best way to test would be to run it. Did you try the tensor2tensor pipeline on the enzh corpus data? How does it tokenize afterwards? How does it train? Experiments are the best answer :).
The BLEU score of zhen
on wmt, trained 100000 steps on 1 card, with 200,000 training data, is only 0.89
, which makes me believe it's better to segment the sentences first.
@cshanbo : could you make a PR with this dataset? I'd be very curious to try it too, it seems strange that tokenization would matter. Maybe a lower learning rate? If make a PR, I'll be able to try running it too :).
Hi,@lukaszkaiser. Could you tell me, what is the difference between space id of English tokens and English bpe tokens? In my mind, bpe tokens can also be saw as tokens, so there is no need to constrcut a new bpe tokens. when you do ende bpe problems. Your "vocab.bpe.32000" just a set of bpe tokens? like following is my bpe vocabulary:
the 的 . of and 和 to in a 在 不 ) ( on 这 for 上 is ......................
Hi @lukaszkaiser I will make a PR later today. Just as a reminder:
training-parallel-nc-v12.tgz
and develop data dev.tgz
given here.dev.tgz
, for zhen
tasks, there are no ready-to-use plain text file available, but only sgm
file used for mt-eval
. We need to pre-process the newsdev2017-zhen-src.zh.sgm
and newsdev2017-zhen-ref.en.sgm
to plain text first.@lqfarmer : you're right that BPE is almost the same as our subword tokenizer (also sometimes called wordpiece, WPM); but some BPE implementations drop some spaces, don't include letters and don't care about Unicode, so some results might be different. These are usually small things though, sometimes they might make no difference at all.
@cshanbo : I'm super happy to hear you're getting good results :). The files you mention are all right, moving from .sgm to text should be very easy, it's just removing tags, right? You could do it directly in python while reading the files, or separately, as you prefer of course. Thanks!
Hi all, just wondering if the current version of the zhen wmt 17 task is full size or is truncated, their size looks really small comparing to ende and enfr, a magnitude smaller for the former and 2 magnitude smaller for the latter. Also it is interesting to see enzh_rev's approx bleu is close to 15 point lower than enzh in a 120k step setting.
@colmantse: T2T enzh training data consist just of News Commentary v12. But WMT17 provided also links to the UN Parallel Corpus V1.0 (which can be downloaded after free registration) and CWMT Corpus (where a password for ftp download is provided).
Thanks @martinpopel , so if I need to train a model with the whole data, then I would need to download it and upload to the google drive and rewrite the link at the enzh problem?
@colmantse: you can download the whole data to your t2t_tmp directory (and possibly pre-process to one of the supported formats). If T2T finds the file there, it won't try to download it (so you can keep the download url empty or whatever). This way the experiment won't be easily replicable by others. Uploading the data to a google drive and making the download fully automatic would be ideal for T2T user, but I am afraid this is not possible for legal reasons here. Neither UN v1.0 nor CWMT comes with a licence (well, I could not find any licence) allowing redistribution and I guess there is a reason behind the registration (the UN corpus authors had a lot of work with providing it for free and cleaned-up and I guess the number of registered users may be important for their grant agencies). That said, you can ask the authors and/or implement automatic download of CWMT via ftp from its original location.
@martinpopel Thank you for your heads up. I will see if I can get it working with the readme.
@martinpopel A quick question, do I simply leave blank the link and put down the names of self downloaded datasets in order to use them? datasets = _ENDE_TRAIN_DATASETS if train else _ENDE_TEST_DATASETS
_ENDE_TRAIN_DATASETS = [
[
"http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz", # pylint: disable=line-too-long
("training/news-commentary-v12.de-en.en",
"training/news-commentary-v12.de-en.de")
],
[
"http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz",
("commoncrawl.de-en.en", "commoncrawl.de-en.de")
],
[
"http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz",
("training/europarl-v7.de-en.en", "training/europarl-v7.de-en.de")
],
You can keep the http... download link empty (or arbitrary), but you need to provide the extracted files ("training/news-commentary-v12.de-en.en", "training/news-commentary-v12.de-en.de" in case of ende) in the t2t_tmp directory. You can either edit wmt.py and add the two datasets to _ZHEN_TRAIN_DATASETS, or you can follow the readme and create a new problem (with unique name) and specify it in my_registrations.py (or whatever file in a the directory provided in --t2t_usr_dir
).
thank you very much!
@cshanbo @lukaszkaiser Hi, I've read all your discussion and I'm using t2t on en-zh translation, too. I want to know if I need to do Chinese word segmentation before t2t's preprocess. Have you compare the results of segmented Chinese and non-segmented Chinese?
hi,
is there information on expected BLEU? I am wondering how do we know the implementation/data/checkpoints are correct if we don't know what the target metrics should be...? I would really love to know the bleu score of, for example, translate_enzh_wmt32k/transformer_base_single_gpu
approximately... has anybody achieved reasonable performance with the current impl? can you share some metrics and example translations?
At first, thank you very much for releasing such good tools for Seq2Seq problems. Those days, I used tensor2tensor to translate Chinese to English. And I got two problems. I hope you can give some suggestions or advice. First, when I enlarge the vocabulary size to 300000, which is original 32768, during the testing time, some Chinese word appeared in target English translation. I used the give example, wmt_ende_tokens_32k, just replace the source language with Chinese, and target language with my English corpus. how does this happened.
Second, I got confused about the realization code when I see following function
def transformer_prepare_encoder(inputs, target_space, hparams): """Prepare one shard of the model for the encoder.
Flatten inputs.
ishape_static = inputs.shape.as_list() encoder_input = inputs encoder_padding = common_attention.embedding_to_padding(encoder_input) encoder_self_attention_bias = common_attention.attention_bias_ignore_padding( encoder_padding)
Append target_space_id embedding to inputs.
emb_target_space = common_layers.embedding( target_space, 32, ishape_static[-1], name="target_space_embedding") emb_target_space = tf.reshape(emb_target_space, [1, 1, -1]) encoder_input += emb_target_space if hparams.pos == "timing": encoder_input = common_attention.add_timing_signal_1d(encoder_input) return (encoder_input, encoder_self_attention_bias, encoder_padding)
in following code:
emb_target_space = common_layers.embedding( target_space, 32, ishape_static[-1], name="target_space_embedding") emb_target_space = tf.reshape(emb_target_space, [1, 1, -1]) encoder_input += emb_target_space
I can't understand: why you need to add target embedding into input, when prepare date for encoder ?I don't find any clue in orginal paper. I feel that this is close to my first question. And, How dose " common_layers.embedding"function doing, especially about number "32",why you choose this number. thank you very much, I am looking forward about you reply.