tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

*help* Adding features to a translation problem #783

Open francisr opened 6 years ago

francisr commented 6 years ago

Description

I'm trying to add features to a translation problem, I have a separate file with a line per input line, containing two integers.
I manage to store the values in the tfrecord by modifiying generate_samples, but then when I read the data I don't get them.
I've tried to get inspiration from this pull request: https://github.com/tensorflow/tensor2tensor/pull/703 but I got lost with input modalities, I'm not sure what I'm supposed to do there.

francisr commented 6 years ago

To give more context, I'm trying to train a multilingual model, with separate encoders and decoders for each language, so I need to read source and target language ID for each example.

franckbrl commented 6 years ago

Hello,

In #703, I had to set input modalities for the source features, among others, because they are represented by embeddings. It seems that this is not your case. If I understand correctly, when you generate a sentence pair, you need a third element saying which language the source/target is in. When you've caught this element, you are able to select the right encoder/decoder.

Have you modified generate_encoded_samples as well? If you have created a new key for the dictionaries in generate_samples, you need to catch it in generate_encoded_samples, otherwise it will be ignored.

As far as I know, there is no easy way to add a new dimension (source feature, some meta-info...) to the data. You need to go through the whole pipeline in which your data flows and see where this third dimension gets lost.

francisr commented 6 years ago

If I understand correctly, when you generate a sentence pair, you need a third element saying which language the source/target is in. When you've caught this element, you are able to select the right encoder/decoder.

Yes exactly.

Have you modified generate_encoded_samples as well? If you have created a new key for the dictionaries in generate_samples, you need to catch it in generate_encoded_samples, otherwise it will be ignored.

No I haven't, in generate_samples I set the info like this sample["lang_src"] = [int(lang_src)], it gets passed through generate_encoded_samples, and I can indeed see these values when I inspect the data that gets stored on disk.

At train time the data seems to be loaded from Problem.dataset, when I print(dataset) in this function, I get something like:

<ParallelMapDataset shapes: {inputs: (?,), targets: (?,)}, types: {inputs: tf.int64, targets: tf.int64}>
franckbrl commented 6 years ago

I would also expect "lang_src" to be there. Theoretically, if the sample has been written on the disc with your "lang_src" key, it should also be uploaded during training. Are you positive that it is in text_problems.text2text_generate_encoded?

francisr commented 6 years ago

Yes, when I add a print before the yield in this function, I get:

{'inputs': [7, 8, 6, 106, 20, 79, 264, 23, 283, 154, 14, 256, 5, 275, 22, 33, 240, 85, 50, 75, 87, 89, 83, 53, 39, 315, 336, 5, 117, 50, 353, 43, 89, 84, 53, 2, 1], 'lang_src': [0], 'targets': [242, 21, 122, 13, 164, 18, 73, 21, 152, 31, 3, 37, 22, 11, 24, 85, 34, 32, 143, 87, 88, 83, 56, 10, 254, 25, 26, 29, 42, 3, 120, 11, 24, 43, 88, 84, 56, 2, 1], 'lang_tgt': [1]}

Also when I inspect the record I can see it:

python tensor2tensor/data_generators/inspect_tfrecord.py --input_filename=/exp0/mt/exp/multi_lang_problem-0/t2t_data/multi_lang_problem-train-00000-of-00100 --print_all 
targets: [242, 21, 122, 13, 164, 18, 73, 21, 152, 31, 3, 37, 22, 11, 24, 85, 34, 32, 143, 87, 88, 83, 56, 10, 254, 25, 26, 29, 42, 3, 120, 11, 24, 43, 88, 84, 56, 2, 1]
lang_src: [0]
inputs: [7, 8, 6, 106, 20, 79, 264, 23, 283, 154, 14, 256, 5, 275, 22, 33, 240, 85, 50, 75, 87, 89, 83, 53, 39, 315, 336, 5, 117, 50, 353, 43, 89, 84, 53, 2, 1]
lang_tgt: [1]
total_sequences: 1
total_input_tokens: 37
total_target_tokens: 39
nonpadding_input_tokens: 37
nonpadding_target_tokens: 39
max_input_length: 37
max_target_length: 39
franckbrl commented 6 years ago

Ok, so maybe you do need to set an input_modality for the language IDs. You could try by modifying hparam() and add something like p.input_modality["lang_src"] = (registry.Modalities.SYMBOL, some_voc_size). The vocabulary size seems required.

francisr commented 6 years ago

I found the issue, it was example_reading_spec that defines this, I had overridden it, but there was a mistake in my code so lang_src and lang_tgt where not in data_fields.
I have a follow up question: is there an easy way to ensure that the dataset is shuffled so that all examples of the same batch have the same lang_src and lang_tgt?

martinpopel commented 6 years ago

see #741