Open DarkT2014 opened 5 years ago
This suggestion seems to be appearing often
Processing token [阿伦特中心邀请琼根发言的决策] took 3.45979595184 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I noticed that the max subtoken length is 200 for the default. I am trying to figure out if that's the number of sub tokens (I would imagine that would be the vocab size if we're using the subword model) or if that's the length of the word to be represented by a subtoken. If it's the latter, that is very large; 200 longer than most words.
Let me know if you figure out how exactly max_subtoken_length
is defined, and if it should be changed.
The parameter default is set here:
This is where max_subtoken_length
is used
Did you figure out how to define max_subtoken_length
manually?
update: I tried to change max_subtoken_length in the source code, and build it. But the question is, neither bigger nor smaller the value made, the problem still exist. I think maybe the tokenizer for chinese have some problems, is it a bug?
Description
I followed the tutorials in https://cloud.google.com/tpu/docs/tutorials/transformer to build a translation model, the Memory Error occurs when I use t2t-datagen with changing problem para to translate_enzh_wmt32k. It also occurs when I do it on my own machine. Is it a bug or anything should I do?
Environment information
GCP
For bugs: reproduction and error logs