Interested in your paper

+1. I'm also working on the reproduction of the multimodal model with a parallel attention strategy, while there are some difficulties to make it done. I hope to hear from the authors.

Handling WordPieace, including how to extract/save subtokens (by using t2t?) and how to compose the INI files.
Feature extraction: I'm not sure how to write an INI file that uses the feature from theresnet_v2_50/block4/unit_3/bottleneck_v2/conv3 sublayer in ResNet_v2_50. An INI file mimicking the INI files in test examples gives OOM error.

Here is my INI file:

;; Multimodal Transformer with a parallel attention

[main]
name="transformer"
tf_manager=<tf_manager>
output="examples/output/parallel"
overwrite_output_dir=True
batch_size=32
epochs=1000
train_dataset=<train_data>
val_dataset=<val_data>
trainer=<trainer>
runners=[<runner>]
evaluation=[("target", evaluators.BLEU), ("target_greedy", "target", evaluators.BLEU)]
logging_period=100
validation_period=1000
random_seed=1234

[tf_manager]
class=tf_manager.TensorFlowManager
num_sessions=1
num_threads=4

[image_reader]
class=readers.image_reader.imagenet_reader
prefix="/flickr30k-images"
target_width=224
target_height=224
zero_one_normalization=True

[train_data]
class=dataset.load
series=["source", "target", "images", "source_bpe", "target_bpe"]
data=["examples/data/translation/train.en", "examples/data/translation/train.de", ("examples/data/translation/train_images.txt", <image_reader>), (<wp_preprocess>, "source"), (<wp_preprocess>, "target")]
[val_data]
class=dataset.load
series=["source", "target", "images", "source_bpe", "target_bpe"]
data=["examples/data/translation/val.en", "examples/data/translation/val.de", ("examples/data/translation/val_images.txt", <image_reader>), (<wp_preprocess>, "source"), (<wp_preprocess>, "target")]

[wp_preprocess]
class=processors.wordpiece.WordpiecePreprocessor
vocabulary=<vocabulary>

[vocabulary]
class=vocabulary.from_wordlist
path="examples/data/translation/wordpieces.clean"
contains_header=False
contains_frequencies=False

[inpseq]
class=model.sequence.EmbeddedSequence
name="input"
embedding_size=256
max_length=50
data_id="source_bpe"
vocabulary=<vocabulary>

[encoder]
class=encoders.transformer.TransformerEncoder
name="text_encoder"
input_sequence=<inpseq>
ff_hidden_size=2048
depth=6
n_heads=8
dropout_keep_prob=0.7

[imagenet]
class=encoders.imagenet_encoder.ImageNet
name="imagenet_resnet"
data_id="images"
network_type="resnet_v2_50"
spatial_layer="resnet_v2_50/block4/unit_3/bottleneck_v2/conv3"
slim_models_path="lib/models/research/slim"

[decoder]
class=decoders.transformer.TransformerDecoder
name="decoder"
encoders=[<encoder>,<imagenet>]
dropout_keep_prob=0.5
data_id="target_bpe"
max_output_len=50
vocabulary=<vocabulary>
embedding_size=256
ff_hidden_size=2048
depth=6
n_heads_self=8
n_heads_enc=8
attention_combination_strategy="parallel"

[trainer]
class=trainers.delayed_update_trainer.DelayedUpdateTrainer
batches_per_update=5
l2_weight=1.0e-8
clip_norm=1.0
objectives=[<obj>]
optimizer=<lazyadam_g>

[obj]
class=trainers.cross_entropy_trainer.CostObjective
decoder=<decoder>

[lazyadam_g]
class=tf.contrib.opt.LazyAdamOptimizer
beta1=0.9
beta2=0.98
epsilon=1.0e-9
learning_rate=<decayed_lr>

[decayed_lr]
class=functions.noam_decay
learning_rate=0.2
model_dimension=6
warmup_steps=111

[runner]
class=runners.GreedyRunner
decoder=<decoder>
postprocess=processors.wordpiece.WordpiecePostprocessor
output_series="target_greedy"

and the wordlist file of subtokens is like:

<pad>
<s>
</s>
<unk>
.
a
in
ein
einem
,
...

ufal / neuralmonkey

Interested in your paper #832