Adding data pipeline for balanced batches

omoindrot commented 6 years ago

The current implementation uses tf.data for the input pipeline and only create random batches.

The triplet loss implementation accepts the following inputs:

def batch_all_triplet_loss(labels, embeddings, margin, squared=False):

where the labels could be anything.

Why a high number of classes is an issue

In the case of MNIST, if the batch size is big enough you are assured to have a high amount of useful triplets. For instance if you have a batch of 100 random images, you will on average have 10 images of each digit and you will be able to easily construct triplets.

However when there are a lot of classes, this approach breaks down. For instance if you have 10,000 classes and a batch size of 100, the probability that all images in the batch have distinct labels is around 60%. If this happens, then no triplet can be built and the loss will be useless.

Proposed solution

I'll work on building a data pipeline that automatically creates balanced batches of data. This should use the following arguments in the params.json file:

num_classes_per_batch: number of different classes to include in the batch
num_examples_per_class: number of examples of each class to include

The batch size will be the product of these two numbers.

Additionally, the previous signature of batch_all_triplet_loss(labels, embeddings, margin) can be replaced with balanced_batch_all_triplet_loss(num_classes, num_examples, embeddings, margin) since we don't need the labels in this case.

We don't need the labels because all the batches will always contain the same order of examples. The labels will always look like:

[1, 1, 1, 1, 5, 5, 5, 5, 3, 3, 3, 3]

(in this example, we have num_classes=3 and num_examples=4 for a total batch size of 12)

The reason to do this is that we don't need to compute the masks dynamically with _get_triplet_mask(labels) since the mask will always be the same, so we can hard-code it statically in the graph. This will lead to some performance improvements.

fursovia commented 6 years ago

This will be so helpful, thanks! Do you have any approximate timeline for when it could happen?

omoindrot commented 6 years ago

I'm starting a new job soon so I'm not sure how much time I'll have. You can maybe try to build a working solution in a fork to see how that works?

omoindrot commented 6 years ago

@fursovia : something like that would work

from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset

import model.mnist_dataset as mnist_dataset

# Define the data pipeline
mnist = mnist_dataset.train(args.data_dir)

datasets = [mnist.filter(lambda img, lab: tf.equal(lab, i)) for i in range(params.num_labels)]

def generator():
    while True:
        # Sample the labels that will compose the batch
        labels = np.random.choice(range(params.num_labels),
                                  params.num_classes_per_batch,
                                  replace=False)
        for label in labels:
            for _ in range(params.num_images_per_class):
                yield label

selector = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = DirectedInterleaveDataset(selector, datasets)

batch_size = params.num_classes_per_batch * params.num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1)

You would need the nightly build from tensorflow:

pip install tf-nightly

This will contain the DirectedInterleaveDataset. However it is not in the public interface so we still need to import it directly with from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset.

TengliEd commented 6 years ago

Nice test @omoindrot . But I still have no idea about applying DirectedInterleaveDataset on raw msceleb1m dataset. I asked and commented on your answer on stackoverflow. By the way, I don't use metric learning but arcface loss.

omoindrot commented 6 years ago

Hi @TengliEd, if you are using the arcface loss I think you don't need to have these balanced batches. Correct me if I'm wrong but you should be able to train on a normal random batch of data, like with softmax.

vzxxbacq commented 5 years ago

Hello @omoindrot . Have you tested which method is faster sampling from multiple files or single file with filter?

andropar commented 5 years ago

Hi @omoindrot, I implemented your proposed solution, but batch generation is extremely slow for a big number of classes. Any ideas why this is and how to circumvent it?

omoindrot commented 5 years ago

My code above is very slow because of the dataset.filter(...) used to build the datasets.

The filter method will go through all examples until it finds one with the correct label, so if you have 1,000 labels, this will be 1,000 times slower.

The solution would be to create the datasets (one per label) in a different way. For instance if you have filenames (containing images) and labels, you can create one list of filename per label:

num_labels = 1000
datasets = []
for label in range(num_labels):
    # Get the filenames for this label
    filenames_per_label = ...
    dataset = tf.data.Dataset.from_tensor_slices((filenames_per_label,
                                                  [label] * len(filenames_per_label)))
    datasets.append(dataset)

By the way, a better to do what I did before is to use the new tf.contrib.data.choose_from_datasets (or tf.data.experimental.choose_from_datasets since v1.12):

num_labels = 10
num_classes_per_batch = 4
num_images_per_class = 8

# Create the list of datasets as you like
datasets = ...

def generator():
    while True:
        # Sample the labels that will compose the batch
        labels = np.random.choice(range(num_labels),
                                  num_classes_per_batch,
                                  replace=False)
        for label in labels:
            for _ in range(num_images_per_class):
                yield label

choice_dataset = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)

batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None)

maffos commented 5 years ago

Hey @omoindrot I have tried the solution using tf.data.experimental.choose_from_datasets. However my process gets killed when I try to train my network. I think might be because the list with all the datasets exceeds my working memory. I have ~4000 classes with ~200 instances per class in my dataset. Do you maybe know of any other way?

omoindrot commented 5 years ago

If you are working with images stored in jpg files for instance, you can apply tf.data.experimental.choose_from_datasets only on the filenames and labels (which should be very fast), and then load the images from these filenames.

This would be like:

num_labels = 4000
num_classes_per_batch = 4
num_images_per_class = 8

image_dirs = ["data/class_{:04d}".format(i) for i in range(num_labels)]

# Create the list of datasets creating filenames
datasets = [tf.data.Dataset.list_files("{}/*.jpg".format(image_dir) for image_dir in image_dirs]

def generator():
    while True:
        # Sample the labels that will compose the batch
        labels = np.random.choice(range(num_labels),
                                  num_classes_per_batch,
                                  replace=False)
        for label in labels:
            for _ in range(num_images_per_class):
                yield label

choice_dataset = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)

# Now you read the image content
def load_image(filename):
    ...
    return image, label

dataset = dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)

batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None)

batrlatom commented 5 years ago

Will this work also on batch_hard? Or do you have any suggestion how to make batch_hard work with thousands of classes?

omoindrot commented 5 years ago

@batrlatom : I would say yes, this is only the data pipeline so it should work the same for batch_hard and batch_all.

christk1 commented 5 years ago

@omoindrot if i have a list of labels eg [1, 1, 1, 4, 5, 3, 2, 2] and then use the choose_from_datasets like in your example, then it will select random images from each label?

omoindrot commented 5 years ago

It will be if the datasets are shuffled before.

TengliEd commented 5 years ago

@omoindrot since choose_from_datasets can randomly select element from dataset in datasets, we need shuffle dataset beforehand?

omoindrot commented 5 years ago

choose_from_datasets will pick the first element of the datasets[idx] where idx is the next index returned by choice_dataset.

It's like if you had 10 pile of plates (one pile = one dataset). Someone (choice_dataset) tells you from which pile to take a plate. But you will take the plate at the top by default, so if you want shuffled plates, you need to shuffle each dataset beforehand.

For instance:

datasets = [tf.data.Dataset.list_files("{}/*.jpg".format(image_dir) for image_dir in image_dirs]
datasets = [dataset.shuffle(buffer_size) for dataset in datasets]

TengliEd commented 5 years ago

@omoindrot As my experimental result shows, it did not take the plate at the top but randomly take in each pile

TengliEd commented 5 years ago

@omoindrot Your triplet preparation code worked as num_labels=20000. However, when num_labels=40000, the error below occurred. It means this method cannot make triplets for a large number of classes? 7926cf38b42f0f6a1bddd97e3

sseveran commented 5 years ago

@TengliEd I hit the same issue with ~7300 datasets. I have opened an issue to track this in tensorflow https://github.com/tensorflow/tensorflow/issues/29753.

You can disable optimization for tf.data using options.experimental_optimization.apply_default_optimizations = False

cyrusvahidi commented 5 years ago

If you are working with images stored in jpg files for instance, you can apply tf.data.experimental.choose_from_datasets only on the filenames and labels (which should be very fast), and then load the images from these filenames.

This would be like:

num_labels = 4000
num_classes_per_batch = 4
num_images_per_class = 8

image_dirs = ["data/class_{:04d}".format(i) for i in range(num_labels)]

# Create the list of datasets creating filenames
datasets = [tf.data.Dataset.list_files("{}/*.jpg".format(image_dir) for image_dir in image_dirs]

def generator():
    while True:
        # Sample the labels that will compose the batch
        labels = np.random.choice(range(num_labels),
                                  num_classes_per_batch,
                                  replace=False)
        for label in labels:
            for _ in range(num_images_per_class):
                yield label

choice_dataset = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)

# Now you read the image content
def load_image(filename):
    ...
    return image, label

dataset = dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)

batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None)

Would you place this code in train_input_fn? Would the selected batch be repeated throughout the epoch this way?

Interestingly, if I select one batch with this code and repeat it for the epoch my loss converges below the margin. However, if I generate balanced batches, using the whole dataset for an epoch, the loss converges at the margin.

omoindrot commented 5 years ago

Would you place this code in train_input_fn? Would the selected batch be repeated throughout the epoch this way?

Yes. The batches generated are random so there is no "repeat" needed.

Interestingly, if I select one batch with this code and repeat it for the epoch my loss converges below the margin. However, if I generate balanced batches, using the whole dataset for an epoch, the loss converges at the margin.

This is because overfitting on one batch is easy, and the loss will converge to 0. When working on the full dataset, you may have other convergence issues that could be solved by lowering the learning rate, changing other hyperparameters or pretraining the network first on a softmax loss.

connorlbark commented 5 years ago

So, I am currently using a method of creating balanced batches via creating a tensorflow dataset from a generator of 20 examples from a single class, shuffling with those examples, unbatching, then batching again with a batch size of 64. It's been an effective (and simple) way to creating a balanced dataset (with the rare edge case that it doesn't overlap favorably), but I still have been unable to train effectively. It will always converge to the embeddings being zero.

I have tried changing my embedding size to equal the number of classes in my dataset and pre-train the model on softmax, which I can get to ~85% accuracy easily. This will reduce the starting triplet loss significantly, but it will still ultimately fail to converge.

I've tried many different hyperparameters, including extremely small learning rates (1e-7), but that will just make it collapse slower. Perhaps I should try even lower?

Any ideas? I'm at a loss (no pun intended)

omoindrot commented 4 years ago

@porgull : you can try to overfit a very small dataset (one triplet to begin with, then a bit more) and make sure that the loss converges to 0 on the training set. This should help catch some bugs.

Otherwise I would check the data and make sure that the generated triplets look correct.

saravanabalagi commented 4 years ago

Yes. The batches generated are random so there is no "repeat" needed.

The batches are generated at random and the generator will keep giving random labels infinitely. However the individual datasets (dataset created per class) won't; they yield images sequentially and since there's no repeat for them, they will eventually stop after the last image is yielded. This will generate data exactly and only for one epoch, until all images in each of these datasets are consumed. It won't run any further. Playground here

Will have to do this for it repeat forever. But there's no guarantee that each epoch will not have a particular image more than once.

# Create the list of datasets creating filenames
datasets = [tf.data.Dataset.list_files("{}/*.jpg".format(image_dir).repeat() for ....]

omoindrot commented 4 years ago

@saravanabalagi : good point, the original datasets need to yield samples infinitely.

It's then up to you how you want to control the amount of data coming from each dataset, and whether you want to oversample some datasets.

kasri-mids commented 4 years ago

So, I am currently using a method of creating balanced batches via creating a tensorflow dataset from a generator of 20 examples from a single class, shuffling with those examples, unbatching, then batching again with a batch size of 64. It's been an effective (and simple) way to creating a balanced dataset (with the rare edge case that it doesn't overlap favorably), but I still have been unable to train effectively. It will always converge to the embeddings being zero.

I have tried changing my embedding size to equal the number of classes in my dataset and pre-train the model on softmax, which I can get to ~85% accuracy easily. This will reduce the starting triplet loss significantly, but it will still ultimately fail to converge.

I've tried many different hyperparameters, including extremely small learning rates (1e-7), but that will just make it collapse slower. Perhaps I should try even lower?

Any ideas? I'm at a loss (no pun intended)

@porgull Did you get this resolved? I am facing the same issue...Thanks!

majdirabia commented 4 years ago

Hi, Anyone tried this and faced an issue when loading file from filename ? I have .npy files and get this error :

TypeError: expected str, bytes or os.PathLike object, not Tensor

Quite lost here as I tried to solve creating a wrapper around tf.py_func.

Code :

    def get_data_from_filename(filename):
        npdata = np.load(filename)
        return npdata, int(filename.split('_')[1])

    def get_data_wrapper(filename):
        features, labels_in = tf.py_function(
            get_data_from_filename, [filename], (tf.float32, tf.int32))
        return tf.data.Dataset.from_tensor_slices((features, labels_in))

majdirabia commented 4 years ago

Hi,

Could anyone help me ? Still stuck with file loading, as it expects paths str rather than Tensors. If someone can show me how they implemented their load_image() function, it will give me a better idea on how to adapt it in my use case of .npy files.

Cheers, Majdi

paweller commented 3 years ago

Hello everyone,

first of all thank you for the initial input on balanced batches @omoindrot.

Unfortunately, as I am working with Keras and NumPy inpud data I was not able to use omoindrot's solution. So I dug down deeper into the topic and found another GitHub repository by @soroushj showing how to implement "A Keras-compatible generator for creating balanced batches". However, it does not feature any num_classes_per_batch and/or num_samples_per_class functionallity.

So I took it as a starting point and extended it by the mentioned funcitonalities. It became a Keras-compatible balanced batch generator suited for triplet loss applications. As it is built using the keras.utils.Sequence object, the generator is multiprocessing-aware and can be shuffled. It was tested on the omniglot dataset with the Vinyals spilts (according to this GitHub repository) and yielded a pretty well balanced class distribution (standard deviation of six) across the entirety of batches used during the training process (75 epochs with 42 batches per epoch). Further information and the source code can be found here. I am by no means a coding expert, so please do not hesitate to contribute.

Thank you!

omoindrot / tensorflow-triplet-loss

Adding data pipeline for balanced batches #7

Why a high number of classes is an issue

Proposed solution