Implementing multi-problem from scratch in t2t

stefan-falk commented 5 years ago

Since I was not able to get MultiProblem to work with our in-house setup, I decided to implement a multi-problem on my own.

For three reasons:

learn something
get rid of the obligatory language model task in the t2t MultiProblem
use external data shards which we created without t2t

In case I succeed, I will update my progress here. Mainly because I am hoping for some feedback (@urvashik, @afrozenator, @rsepassi) but also because I would like to see the above in t2t and can imagine to make a pull request for this in the future.

stefan-falk commented 5 years ago

Update

I have now implemented my own version of a multi-problem. I am able to run this but I haven't had time to run enough experiments to see a positive effect yet (if I am at all able to do that).

I simply wasn't able to get the MultiProblem to work so I went top-down in a "happy path"-like manner.

1. Create `Datasets` and use `sample_from_datasets` for sampling

Essentially what I did was creating a list of Datasets for each problem and I use tf.data.experimental.sample_from_datasets in order to randomly sample from datasets according to a constant distribution.

def dataset(self,
            mode,
            data_dir=None,
            # ..
            ):

    is_training = mode == tf.estimator.ModeKeys.TRAIN

    datasets = list()
    for task in self.tasks:
        task_dataset = task.dataset(
                mode,
                data_dir=data_dir,
                # ..
            )

        task_dataset = task_dataset.map(lambda example: self.add_task_id(task.task_id, example))

        if is_training:
            task_dataset = task_dataset.repeat()

        datasets.append(task_dataset)

    sampled_dataset = tf.data.experimental.sample_from_datasets(
        datasets,
        weights=np.asarray(self.task_weights, dtype=np.float64) / sum(self.task_weights)
    )

    return sampled_dataset

2. Remove obligatory primary task (language model)

I cannot see a particular reason why a language model is required as primary task. In fact, I think it's more of a burden but please correct me if there is another reason for having this constraint besides the requirement of a shared vocabulary among all tasks. To resolve the shared-vocab issue, I just create all vocabularies of all sub-tasks (sub-problems) and merge them like so:

def get_or_create_vocab(self, data_dir=None, tmp_dir=None, force_get=False) -> SubwordEncoder:
    logger = tf.logging

    vocabs_dir = os.path.join(data_dir, 'shared-vocabs')
    if not tf.gfile.Exists(vocabs_dir):
        tf.gfile.MakeDirs(vocabs_dir)

    vocab_fp = os.path.join(vocabs_dir, self.vocab_filename)

    if tf.gfile.Exists(vocab_fp):
        return SubwordEncoder(vocab_fp)

    # Merge vocabularies

    datasets = self.get_datasets(data_dir)
    reserved_tokens = set()
    subword_tokens = set()
    for dataset in datasets:
        dataset.install()
        encoders = dataset.feature_encoders()
        for (k, encoder) in encoders.items():
            encoder = cast(SubwordEncoder, encoder)
            reserved_tokens.update(encoder.reserved_tokens)
            subword_tokens.update(encoder.subword_tokens)

    final_subwords = RESERVED_TOKENS + sorted(list(subword_tokens.difference(reserved_tokens)))

    logger.info('[%s] - Merged vocabularies; Final vocab size: %s' % (self.name, len(final_subwords)))

    with tf.gfile.Open(vocab_fp, mode='w') as vocab_f:
        for subword in final_subwords:
            vocab_f.write('\'%s\'\n' % subword)

    return SubwordEncoder(vocab_fp)

3. Change model input

Currently, if a language model is present, the model input looks like this:

[<input-ids> <task-id> <target-ids> <eos>]

and if I got it right, the actual "input" is then just [<input-ids> <task-id>] for the decoder of the Transformer which then tries to predict the rest of the sequence. Because of the obligatory language model, the Transformer lost the encoder.

Since I removed the obligatory language model as explained above, the entire graph for the Transformer will be created and the input is "back to normal". There is just a small addition however. The task-id now gets inserted at the beginning of the inputs sequence:

def add_task_id(task_id, example):
    concat_list = [[task_id], example['inputs']]
    example['inputs'] = tf.concat(concat_list, axis=0)
    example['task_id'] = tf.constant([task_id], dtype=tf.int64)
    return example

There's probably room for improvement but it works so far. Here is a small example for a multi-problem training en->de and en->fr from my tensorboard:

Input, Target, Prediction

I would like to know whether this looks okay and might have a future. If yes, I might consider creating a pull-request in order to implement a new MultiProblem like above.

Any feedback would be great.

stefan-falk commented 5 years ago

I was able to successfully apply my multi-problem implementation to a translation task for German to English (de2en) mixing additional tasks de2es (to Spanish) and de2fr (to French).

It has to be said that the German de2en task had only around 140k sentence pairs. The other datasets where significantly larger. The image shows different outcomes for different dataset weights passed to tf.data.experimental.sample_from_datasets.

tensorflow / tensor2tensor