tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.34k stars 3.47k forks source link

*help* Skipping long sentences in the t2t-decoder possible? #732

Open e-lectrix opened 6 years ago

e-lectrix commented 6 years ago

Hi,

First of all, many thanks for making this awesome tool available! I managed to create a translation model, using the transformer_base problem, and own data. My aim is to translate a set of documents. When I apply the t2t-decoder on this set of documents, few of them will fail with OOM error. I traced this problem back to some very long sentences (long lists of comma-separated terms). When I remove these instances, the translation runs smoothly.

My question: Is there a way to tell the t2t-decoder function to not include such sentences into the decoding process, and just print them out as is in the resulting document? I had some difficulties to identify potential parameters that would allow this in the source code.

Obviously, I could remove these sentences beforehand and add them back in a later step, but it would be quite cumbersome to ensure the right position in the document.

Many thanks, Matthias

martinpopel commented 6 years ago

Try --hparams="max_length=128,eval_drop_long_sequences=True" or just the eval_drop_long_sequences because the default max_length is the training-time batch size, which may be enough to prevent the OOM errors. However, I am not sure if you'll be able to identify which sentences were skipped. In the log (stderr) you can see also the source sentence, so maybe you can use this.

e-lectrix commented 6 years ago

Martin, thank you very much for your quick answer. Unfortunately, I can't make it work using these two parameters (and decreasing max_length).

My call

t2t-decoder --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --hparams="max_length=60,eval_drop_long_sequences=True" --decode_hparams="beam_size=4,alpha=$ALPHA" --decode_from_file=$DECODE_FILE --decode_to_file=$DECODE_FILE.DEEN.mixed.out --batch_size=4 --t2t_usr_dir=$USER_DIR

The parameters seem to have been taken into account: INFO:tensorflow:Importing user module usr_dir from path /myproject [2018-04-20 14:30:22,169] Importing user module usr_dir from path /myproject INFO:tensorflow:Overriding hparams in transformer_base with max_length=60,eval_drop_long_sequences=True [2018-04-20 14:30:22,433] Overriding hparams in transformer_base with max_length=60,eval_drop_long_sequences=True INFO:tensorflow:schedule=continuous_train_and_eval [2018-04-20 14:30:22,433] schedule=continuous_train_and_eval

This is the final error output:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[347280,7235] [[Node: transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/Softmax = Softmax[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/Reshape)]] [[Node: transformer/body/parallel_0/body/encoder/layer_5/self_attention/multihead_attention/q/Tensordot/Shape/_1705 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1841_transformer/body/parallel_0/body/encoder/layer_5/self_attention/multihead_attention/q/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Any other idea, what I could have a look at? Thank you!

[EDIT] I'm running version 1.5.5 of tensor2tensor, could this be a reason as well?

martinpopel commented 6 years ago

As I think about now, I am not sure if eval_drop_long_sequences affects t2t-decoder at all. There are three regimes (with different sessions): training, evaluation and decoding. During evaluation you also need to decode, but you have the reference translations so you can cheat with non-autoregressive fast mode and it makes sense to allow skipping long sentences here if eval_drop_long_sequences=True. In real decoding, you don't have the reference translations (at least t2t does not see them) and most users want to translate all the sentences.

T2T 1.5.5 is OK.

You can filter the text to be translated for sentences shorter than a given number of words with a simple shell/perl/python script. To filter based on the number of subwords, you can use something like (just out of my memory):

from tensor2tensor.data_generators import text_encoder
vocab = text_encoder.SubwordTextEncoder(FLAGS.vocab)
n_subwords = len(vocab.encode(string))
e-lectrix commented 6 years ago

Thank you. I will then go forward and write some logic prior to the decoding process to prevent these cases from happening and crashing my workflow. Should be no problem to implement.

I think though that this sentence skipping option would be a nice feature to add to the t2t-decoder functionality.

Thanks for your help, much appreciated!

thammegowda commented 4 years ago

Skipping an eval sentence maybe a bad idea. It will hurt the final test score badly for dropping a whole sentence. Also, splitting long sentences at the preprocessing stage (is good, but it ) isnt going to perfectly solve this either since we wouldn't know whats the true length after BPE/subword segmentation. In my case, some shorter sentences got expanded to really long sequence after subword splitting.

At the least, we should be able to truncate the sentence to a certain max length instead of completely skipping it.

Looking at the code https://github.com/tensorflow/tensor2tensor/blob/a0bf3b90b13f75e77fdacf5da025d09309165b92/tensor2tensor/utils/decoding.py#L679-L682 which is what we want. It gets the value from https://github.com/tensorflow/tensor2tensor/blob/a0bf3b90b13f75e77fdacf5da025d09309165b92/tensor2tensor/utils/decoding.py#L449-L453

But that default value is set to -1 (meaning: dont truncate) https://github.com/tensorflow/tensor2tensor/blob/a0bf3b90b13f75e77fdacf5da025d09309165b92/tensor2tensor/utils/decoding.py#L64

(Now, how I figured mapping of decode_hp to --decode_hparams CLI argument via FLAGS stuff! I still don't know 🤣 )

So, this is how we can pass a value from CLI:

$ t2t-decoder --decode_hparams="max_input_size=190" <other args>

max_input_size=190 worked well for me. You may want to adjust this depending on your RAM/GPU-RAM + beam size