Open stefan-falk opened 5 years ago
I have now implemented my own version of a multi-problem. I am able to run this but I haven't had time to run enough experiments to see a positive effect yet (if I am at all able to do that).
I simply wasn't able to get the MultiProblem
to work so I went top-down in a "happy path"-like manner.
Datasets
and use sample_from_datasets
for samplingEssentially what I did was creating a list of Datasets
for each problem and I use tf.data.experimental.sample_from_datasets
in order to randomly sample from datasets according to a constant distribution.
def dataset(self,
mode,
data_dir=None,
# ..
):
is_training = mode == tf.estimator.ModeKeys.TRAIN
datasets = list()
for task in self.tasks:
task_dataset = task.dataset(
mode,
data_dir=data_dir,
# ..
)
task_dataset = task_dataset.map(lambda example: self.add_task_id(task.task_id, example))
if is_training:
task_dataset = task_dataset.repeat()
datasets.append(task_dataset)
sampled_dataset = tf.data.experimental.sample_from_datasets(
datasets,
weights=np.asarray(self.task_weights, dtype=np.float64) / sum(self.task_weights)
)
return sampled_dataset
I cannot see a particular reason why a language model is required as primary task. In fact, I think it's more of a burden but please correct me if there is another reason for having this constraint besides the requirement of a shared vocabulary among all tasks. To resolve the shared-vocab issue, I just create all vocabularies of all sub-tasks (sub-problems) and merge them like so:
def get_or_create_vocab(self, data_dir=None, tmp_dir=None, force_get=False) -> SubwordEncoder:
logger = tf.logging
vocabs_dir = os.path.join(data_dir, 'shared-vocabs')
if not tf.gfile.Exists(vocabs_dir):
tf.gfile.MakeDirs(vocabs_dir)
vocab_fp = os.path.join(vocabs_dir, self.vocab_filename)
if tf.gfile.Exists(vocab_fp):
return SubwordEncoder(vocab_fp)
# Merge vocabularies
datasets = self.get_datasets(data_dir)
reserved_tokens = set()
subword_tokens = set()
for dataset in datasets:
dataset.install()
encoders = dataset.feature_encoders()
for (k, encoder) in encoders.items():
encoder = cast(SubwordEncoder, encoder)
reserved_tokens.update(encoder.reserved_tokens)
subword_tokens.update(encoder.subword_tokens)
final_subwords = RESERVED_TOKENS + sorted(list(subword_tokens.difference(reserved_tokens)))
logger.info('[%s] - Merged vocabularies; Final vocab size: %s' % (self.name, len(final_subwords)))
with tf.gfile.Open(vocab_fp, mode='w') as vocab_f:
for subword in final_subwords:
vocab_f.write('\'%s\'\n' % subword)
return SubwordEncoder(vocab_fp)
Currently, if a language model is present, the model input looks like this:
[<input-ids> <task-id> <target-ids> <eos>]
and if I got it right, the actual "input" is then just [<input-ids> <task-id>]
for the decoder of the Transformer which then tries to predict the rest of the sequence. Because of the obligatory language model, the Transformer lost the encoder.
Since I removed the obligatory language model as explained above, the entire graph for the Transformer will be created and the input is "back to normal". There is just a small addition however. The task-id now gets inserted at the beginning of the inputs
sequence:
def add_task_id(task_id, example):
concat_list = [[task_id], example['inputs']]
example['inputs'] = tf.concat(concat_list, axis=0)
example['task_id'] = tf.constant([task_id], dtype=tf.int64)
return example
There's probably room for improvement but it works so far. Here is a small example for a multi-problem training en->de
and en->fr
from my tensorboard:
Input, Target, Prediction
I would like to know whether this looks okay and might have a future. If yes, I might consider creating a pull-request in order to implement a new MultiProblem
like above.
Any feedback would be great.
I was able to successfully apply my multi-problem implementation to a translation task for German to English (de2en
) mixing additional tasks de2es
(to Spanish) and de2fr
(to French).
It has to be said that the German de2en
task had only around 140k sentence pairs. The other datasets where significantly larger. The image shows different outcomes for different dataset weights passed to tf.data.experimental.sample_from_datasets
.
Since I was not able to get
MultiProblem
to work with our in-house setup, I decided to implement a multi-problem on my own.For three reasons:
MultiProblem
In case I succeed, I will update my progress here. Mainly because I am hoping for some feedback (@urvashik, @afrozenator, @rsepassi) but also because I would like to see the above in t2t and can imagine to make a pull request for this in the future.