Open francisr opened 6 years ago
To give more context, I'm trying to train a multilingual model, with separate encoders and decoders for each language, so I need to read source and target language ID for each example.
Hello,
In #703, I had to set input modalities for the source features, among others, because they are represented by embeddings. It seems that this is not your case. If I understand correctly, when you generate a sentence pair, you need a third element saying which language the source/target is in. When you've caught this element, you are able to select the right encoder/decoder.
Have you modified generate_encoded_samples
as well? If you have created a new key for the dictionaries in generate_samples
, you need to catch it in generate_encoded_samples
, otherwise it will be ignored.
As far as I know, there is no easy way to add a new dimension (source feature, some meta-info...) to the data. You need to go through the whole pipeline in which your data flows and see where this third dimension gets lost.
If I understand correctly, when you generate a sentence pair, you need a third element saying which language the source/target is in. When you've caught this element, you are able to select the right encoder/decoder.
Yes exactly.
Have you modified generate_encoded_samples as well? If you have created a new key for the dictionaries in generate_samples, you need to catch it in generate_encoded_samples, otherwise it will be ignored.
No I haven't, in generate_samples
I set the info like this sample["lang_src"] = [int(lang_src)]
, it gets passed through generate_encoded_samples
, and I can indeed see these values when I inspect the data that gets stored on disk.
At train time the data seems to be loaded from Problem.dataset
, when I print(dataset)
in this function, I get something like:
<ParallelMapDataset shapes: {inputs: (?,), targets: (?,)}, types: {inputs: tf.int64, targets: tf.int64}>
I would also expect "lang_src" to be there. Theoretically, if the sample
has been written on the disc with your "lang_src" key, it should also be uploaded during training. Are you positive that it is in text_problems.text2text_generate_encoded
?
Yes, when I add a print
before the yield
in this function, I get:
{'inputs': [7, 8, 6, 106, 20, 79, 264, 23, 283, 154, 14, 256, 5, 275, 22, 33, 240, 85, 50, 75, 87, 89, 83, 53, 39, 315, 336, 5, 117, 50, 353, 43, 89, 84, 53, 2, 1], 'lang_src': [0], 'targets': [242, 21, 122, 13, 164, 18, 73, 21, 152, 31, 3, 37, 22, 11, 24, 85, 34, 32, 143, 87, 88, 83, 56, 10, 254, 25, 26, 29, 42, 3, 120, 11, 24, 43, 88, 84, 56, 2, 1], 'lang_tgt': [1]}
Also when I inspect the record I can see it:
python tensor2tensor/data_generators/inspect_tfrecord.py --input_filename=/exp0/mt/exp/multi_lang_problem-0/t2t_data/multi_lang_problem-train-00000-of-00100 --print_all
targets: [242, 21, 122, 13, 164, 18, 73, 21, 152, 31, 3, 37, 22, 11, 24, 85, 34, 32, 143, 87, 88, 83, 56, 10, 254, 25, 26, 29, 42, 3, 120, 11, 24, 43, 88, 84, 56, 2, 1]
lang_src: [0]
inputs: [7, 8, 6, 106, 20, 79, 264, 23, 283, 154, 14, 256, 5, 275, 22, 33, 240, 85, 50, 75, 87, 89, 83, 53, 39, 315, 336, 5, 117, 50, 353, 43, 89, 84, 53, 2, 1]
lang_tgt: [1]
total_sequences: 1
total_input_tokens: 37
total_target_tokens: 39
nonpadding_input_tokens: 37
nonpadding_target_tokens: 39
max_input_length: 37
max_target_length: 39
Ok, so maybe you do need to set an input_modality
for the language IDs. You could try by modifying hparam()
and add something like p.input_modality["lang_src"] = (registry.Modalities.SYMBOL, some_voc_size)
. The vocabulary size seems required.
I found the issue, it was example_reading_spec
that defines this, I had overridden it, but there was a mistake in my code so lang_src
and lang_tgt
where not in data_fields
.
I have a follow up question: is there an easy way to ensure that the dataset is shuffled so that all examples of the same batch have the same lang_src
and lang_tgt
?
see #741
Description
I'm trying to add features to a translation problem, I have a separate file with a line per input line, containing two integers.
I manage to store the values in the tfrecord by modifiying
generate_samples
, but then when I read the data I don't get them.I've tried to get inspiration from this pull request: https://github.com/tensorflow/tensor2tensor/pull/703 but I got lost with input modalities, I'm not sure what I'm supposed to do there.