Closed robotzheng closed 6 years ago
That depends on how the different vocabularies map to each other. The simplest case would be that both models are on the word level, the first min(Vocab1, Vocab2) words are the same, and larger IDs are matched with UNK in the smaller model. In that case, it is simply
pred_src_vocab: ... pred_trg_vocab: ...
pred_src_vocab2: ... pred_trg_vocab2: ...
More complicated mappings are possible with the idxmap (1:1 mapping) and fsttok (mapping defined with an FST) predictor wrapper.
python decode.py --ignore_sanity_checks True --predictors t2t --src_test ./datasets/ai_challenger_MTEnglishtoChinese_validationset_20180823_21.en \ --t2t_checkpoint_dir ./models/check --t2t_usr_dir ./models/usr_dir/translate_enzh_AI_context --t2t_model transformer --t2t_problem translate_enzh_context_wmt32k \ --t2t_hparams_set transformer_big --pred_src_vocab_size 32761 --pred_trg_vocab_size 32627
2018-10-09 17:03:48,952 INFO: Next sentence (ID: 7862): ( E calls us most every night,sobbing. ) Did you know that? - Of course I know that. 2018-10-09 17:03:48,952 ERROR: Number format error at sentence id 7862: invalid literal for int() with base 10: '(', Stack trace: Traceback (most recent call last): File "/home/zzt/sgnmt/cam/sgnmt/decode_utils.py", line 849, in do_decode src = [int(x) for x in src] ValueError: invalid literal for int() with base 10: '('
2018-10-09 17:03:48,952 INFO: Next sentence (ID: 7863): ( Where's the car, Buffy? ) That was Brian. He'll be here any sec. 2018-10-09 17:03:48,952 ERROR: Number format error at sentence id 7863: invalid literal for int() with base 10: '(', Stack trace: Traceback (most recent call last): File "/home/zzt/sgnmt/cam/sgnmt/decode_utils.py", line 849, in do_decode src = [int(x) for x in src] ValueError: invalid literal for int() with base 10: '('
why?
my model use word piceces
The word piece model (and in general tokenization) is not supported. We tried to keep everything which is not directly related to scoring or decoding out of SGNMT, as explained here:
http://ucam-smt.github.io/sgnmt/html/tutorial.html
Tokenization etc. needs to be handled by external tools such as moses, subword-nmt etc.
We normally use indexed input files where each token is replaced by its ID. If you have a T2T data set for your test set you can use this script to create such a file. There is also --src_wmap and --trg_wmap if you still wish to use readable text, but the word pieces would need to be separated by whitespace.
@fstahlberg ,I have decode some sentences, but they have some defects。 1、decode too slow, 20 seconds one sentence; 2、 can not set the parameters likes : --hparams="self_attention_type="dot_product_relative",max_relative_position=20"
to 1.) Yes, decoding is considerably slower than eg. the t2t decoder - you buy flexibility regarding the decoding algorithm and the scoring (predictor constellation) with slower decoding. The difference is smaller for CPU decoding. We usually distribute decoding over a couple of CPU hosts which brings down the decoding time to an acceptable level. The concept of SGNMT is prototyping new algorithms/scoring schemes, and reimplementing the ones which turn out to be useful for production use as done eg. by Iglesias et al.. That being said, 20 seconds for a single sentence and a single t2t model does seem rather slow, try playing with --single_cpu_thread and --beam
to 2.) Yes, this is not exposed in SGNMT at the moment, you need to set hparams with t2t hparams sets.
@fstahlberg , I have resolved the second question by "transformer_relative_big" in above, but I can't change other hparams, such as "shared_embedding".
Yeah, at the moment you need to define a t2t hparams set in your t2t usr dir in which you copy from transformer_relative_big and set shared_embedding there (as done here).
@fstahlberg ,Thanks. But I still have a question: how to ensemble context models, maybe "src" are different, "target" are same. such as: src_with_context: ( Hey,babe? ) oh, hey, uh, ben, beth. What A... original src: oh, hey, uh, ben, beth. What A... target: 哦,嘿,呃,本·贝丝,真是…
and how to create idxmap? more details, please.
If both of your models have different sources you can mask one of them with the altsrc predictor wrapper like this:
predictors: t2t,altsrc_t2t src_test: ... altsrc_test: ...
An idxmap example is given in the tutorial using the tutorial data. It is simply a text file with two blank separated numbers in each line which defines the mapping
@fstahlberg , could you give me a code example for t2t and altsrc_t2t with differrent vocabularies, I'm very confused. Thanks a lot.
src_test and altsrc_test can point to different input files with the same number of lines. One is used as input to the first t2t model, the other one as input to the second t2t model.
I would recommend you to go/read through the tutorial - I hope this makes it clearer how predictors, predictor wrappers, and decoders interact in general and how ensembling works. There are more examples with altsrc and idxmap on the example page
Closing because of inactivity. Please feel free to reopen if you have further questions.
thanks.