ucam-smt / sgnmt

Decoding platform for machine translation research
http://ucam-smt.github.io/sgnmt/html/
Apache License 2.0
54 stars 17 forks source link

Can you give me some decode examples: ensemble tensor2tensor models with different vocabulary? #3

Closed robotzheng closed 6 years ago

robotzheng commented 6 years ago

thanks.

fstahlberg commented 6 years ago

That depends on how the different vocabularies map to each other. The simplest case would be that both models are on the word level, the first min(Vocab1, Vocab2) words are the same, and larger IDs are matched with UNK in the smaller model. In that case, it is simply

pred_src_vocab: ... pred_trg_vocab: ...

pred_src_vocab2: ... pred_trg_vocab2: ...

More complicated mappings are possible with the idxmap (1:1 mapping) and fsttok (mapping defined with an FST) predictor wrapper.

robotzheng commented 6 years ago

python decode.py --ignore_sanity_checks True --predictors t2t --src_test ./datasets/ai_challenger_MTEnglishtoChinese_validationset_20180823_21.en \ --t2t_checkpoint_dir ./models/check --t2t_usr_dir ./models/usr_dir/translate_enzh_AI_context --t2t_model transformer --t2t_problem translate_enzh_context_wmt32k \ --t2t_hparams_set transformer_big --pred_src_vocab_size 32761 --pred_trg_vocab_size 32627

2018-10-09 17:03:48,952 INFO: Next sentence (ID: 7862): ( E calls us most every night,sobbing. ) Did you know that? - Of course I know that. 2018-10-09 17:03:48,952 ERROR: Number format error at sentence id 7862: invalid literal for int() with base 10: '(', Stack trace: Traceback (most recent call last): File "/home/zzt/sgnmt/cam/sgnmt/decode_utils.py", line 849, in do_decode src = [int(x) for x in src] ValueError: invalid literal for int() with base 10: '('

2018-10-09 17:03:48,952 INFO: Next sentence (ID: 7863): ( Where's the car, Buffy? ) That was Brian. He'll be here any sec. 2018-10-09 17:03:48,952 ERROR: Number format error at sentence id 7863: invalid literal for int() with base 10: '(', Stack trace: Traceback (most recent call last): File "/home/zzt/sgnmt/cam/sgnmt/decode_utils.py", line 849, in do_decode src = [int(x) for x in src] ValueError: invalid literal for int() with base 10: '('

why?

robotzheng commented 6 years ago

my model use word piceces

fstahlberg commented 6 years ago

The word piece model (and in general tokenization) is not supported. We tried to keep everything which is not directly related to scoring or decoding out of SGNMT, as explained here:

http://ucam-smt.github.io/sgnmt/html/tutorial.html

Tokenization etc. needs to be handled by external tools such as moses, subword-nmt etc.

We normally use indexed input files where each token is replaced by its ID. If you have a T2T data set for your test set you can use this script to create such a file. There is also --src_wmap and --trg_wmap if you still wish to use readable text, but the word pieces would need to be separated by whitespace.

robotzheng commented 6 years ago

@fstahlberg ,I have decode some sentences, but they have some defects。 1、decode too slow, 20 seconds one sentence; 2、 can not set the parameters likes : --hparams="self_attention_type="dot_product_relative",max_relative_position=20"

fstahlberg commented 6 years ago

to 1.) Yes, decoding is considerably slower than eg. the t2t decoder - you buy flexibility regarding the decoding algorithm and the scoring (predictor constellation) with slower decoding. The difference is smaller for CPU decoding. We usually distribute decoding over a couple of CPU hosts which brings down the decoding time to an acceptable level. The concept of SGNMT is prototyping new algorithms/scoring schemes, and reimplementing the ones which turn out to be useful for production use as done eg. by Iglesias et al.. That being said, 20 seconds for a single sentence and a single t2t model does seem rather slow, try playing with --single_cpu_thread and --beam

to 2.) Yes, this is not exposed in SGNMT at the moment, you need to set hparams with t2t hparams sets.

robotzheng commented 6 years ago

@fstahlberg , I have resolved the second question by "transformer_relative_big" in above, but I can't change other hparams, such as "shared_embedding".

fstahlberg commented 6 years ago

Yeah, at the moment you need to define a t2t hparams set in your t2t usr dir in which you copy from transformer_relative_big and set shared_embedding there (as done here).

robotzheng commented 6 years ago

@fstahlberg ,Thanks. But I still have a question: how to ensemble context models, maybe "src" are different, "target" are same. such as: src_with_context: ( Hey,babe? ) oh, hey, uh, ben, beth. What A... original src: oh, hey, uh, ben, beth. What A... target: 哦,嘿,呃,本·贝丝,真是…

robotzheng commented 6 years ago

and how to create idxmap? more details, please.

fstahlberg commented 6 years ago

If both of your models have different sources you can mask one of them with the altsrc predictor wrapper like this:

predictors: t2t,altsrc_t2t src_test: ... altsrc_test: ...

An idxmap example is given in the tutorial using the tutorial data. It is simply a text file with two blank separated numbers in each line which defines the mapping

robotzheng commented 6 years ago

@fstahlberg , could you give me a code example for t2t and altsrc_t2t with differrent vocabularies, I'm very confused. Thanks a lot.

fstahlberg commented 6 years ago

src_test and altsrc_test can point to different input files with the same number of lines. One is used as input to the first t2t model, the other one as input to the second t2t model.

I would recommend you to go/read through the tutorial - I hope this makes it clearer how predictors, predictor wrappers, and decoders interact in general and how ensembling works. There are more examples with altsrc and idxmap on the example page

fstahlberg commented 6 years ago

Closing because of inactivity. Please feel free to reopen if you have further questions.