keras-team / autokeras

AutoML library for deep learning
http://autokeras.com/
Apache License 2.0
9.11k stars 1.39k forks source link

AutoKeras + Deep Metrics Learning (i.e. triplet loss) #1513

Open shun-lin opened 3 years ago

shun-lin commented 3 years ago

Feature Description

Hi,

In TensorFlow Addons there are many Deep Metrics Learning (DML) losses been implemented (such as triplet loss (tutorial), contrastive loss, lifted struct loss, etc) that are used to learn an embedding space as output (a classic example is FaceNet) that uses triplet loss to learn an embedding space for clustering. DML losses are also useful for few shots learning (learning from very few examples).

Code Example


# Prepare the dataset.
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# Initialize the ImageEmbedder.
embedder = ak.ImageEmbedder(max_trials=3)
# Search for the best model.
embedder.fit(x_train, y_train, epochs=5)
# Get the embeddings for test set
test_embeddings = embedder.predict(x_test)

Here is the notebook that shows a quick implementation of ImageEmbedder and EmbeddingHead, which is very straightforward with a few small changes to ClassificationHead (instead of adding an output dense layer for num_class size we added a FC layer with embedding_size plus normalization layer in build()) Similarly we can have ak.TextEmbedder and ak.StructuredDataEmbedder using the same EmbeddingHead.

Reason

This feature will be helpful for anyone who wants to build a DML model with AutoKeras and since there are many different types of losses that all try to learn the embedding space we can also set the type of loss as Hyperparameter to help users find the best DML model for their data/research.

Solution

I originally thought I can just use ImageClassification with triplet_loss from TensorFlow Addons to train a DML model with AutoKeras, but it turns out that this doesn't work well.

notebook This notebook shows a quick implementation of EmbeddingHead and ImageEmbedder. I am very happy to contribute!

haifeng-jin commented 3 years ago

@shun-lin Thanks for the issue! We will see if more people are interested in this feature.

sidphbot commented 3 years ago

Great ! It is definitely useful. M actually a student working on automatic product catalouging. Its just too many classes for softmax based classification so deep metric learning is really essential for me.

However I am encountering an issue would very much like some help on the same. Although the implementation of the image embedder as a whole works, the model is encountering feature collapse. I tested the loss with euclidean and cosine distance but the loss was stuck at margin (1.0002-1.0004), model had collapsed in the initial trials itself.

I am running it for 20-30 trials before the regular too many oversized models error is thrown.

I wanted to know if you can guide me for using the embedding head alone in auto-model to perform a more guided/restricted search space. The HeadModule.Head subclassing implementation seems perfect yet it is not a callable like classification or regression heads or any other block.

Output = ak.ClassificationHead()(processed_input) # works

Output = EmbeddingHead()(processed_input) # throws EmbeddingHead not a callable block

Output = EmbeddingHead() # works but it does not serve the purpose as it does not have the input features

Apologies for the trouble, but if you have any suggestions or ideas regarding the above error or avoiding feature collapse, it would be very helpful to my thesis.

Also there is a very good way to tackle model collapse if everything is encapsulated with custom trials mainly so as to tune the triplet loss margin too. How ever that seems to be a lot of work for tuning just one parameter. I would really like to know if you have any ideas on how to tune the loss margin as well.

shun-lin commented 3 years ago

Thanks for finding this useful!

In regards to loss stuck at margin, this is (almost) the equivalent of loss returning NaN for classification/regress problem as the loss for Triplet Loss is capped at the margin (explanation) so a potential work-around may be an earlyStopping callback for embedding models (maybe in EmbeddingHead) that will stop the training and move on to the next model when loss >= margin.

And yes I agree that the loss margin should be tune-able or should be a HyperParameter, but the problem is that the best model is currently picked by the lowest loss and when we change margin the loss may not be comparable meaningfully, we may want to rely on other metrics to find the best model (ex: clustering quality on validation set), and this is also needed if we want to make the deep metrics learning loss itself to be a HyperParameter (ex: tfa.losses.ContrastiveLoss,tfa.losses.TripletSemiHardLoss, tfa.losses.LiftedStructLoss are all DML losses that try to accomplish the same thing) and to find a good embedding model we should try/tune all those losses (but again the smallest loss != best model so some investigation need to be done here).

sidphbot commented 3 years ago

Hey thanks for the explanation regarding autokeras functioning. Yes other DML losses also need to be considered, infact that was why i asked about ways to tune loss. I did not account for the fact that only the loss is compared to find the best model here.

However I feel even if we fix the loss, the head can be used with custom architectures in auto-model to get a refined search space, instead of a pipeline which will introduce a lot of not-working architectures. Due to the incremental training nature, the penalties will probably keep accumulating till the model decides it cannot produce better embeddings and starts to cheat by producing same embeddings in-order to cancel out in loss function and produce minimum loss which is margin here.

So please let me know about the error i posted.

shun-lin commented 3 years ago

You made a good point, the implementation of EmbeddingHead is very similar to ClassificationHead, is it possible for you to share a notebook where the error w/ EmbeddingHead occurs?

Also I will be making a lightweight package that contains the integration with AutoKeras and DML losses from TensorFlow Addons (like EmbeddingHead, ImageEmbedder, etc) shown in the above notebook until they are merged back into this repository.

sidphbot commented 3 years ago

The code is a part of a distributed pipeline and is also protected by Non-Disclosure Agreement however the you can find the relevant code excerpt below with parameter values, see if it helps,

Data

df = pd.read_csv(csvfile) df_copy = df[df[parameters.ycol].notnull()].copy() df_copy[parameters.ycol] = LabelEncoder().fit_transform(df_copy[parameters.ycol]) x = np.array(df_copy[parameters.xcol].values.tolist()) y = np.array(df_copy[parameters.ycol].values.tolist())

x = get_chunk(x) # loads chunk of images with custom data preparation

x = tf.convert_to_tensor(x, dtype=tf.float32) y = tf.convert_to_tensor(y, dtype=tf.int32) dataset = tf.data.Dataset.from_tensor_slices((x, y))

shape : <BatchDataset shapes: ((None, 180, 180, 3), (None,)), types: (tf.float32, tf.int32)>

train_data = dataset.batch(batch_size)

————

Hyper-Model 1

input_node = ak.ImageInput() processed_input = ak.Normalization()(input_node) processed_input = ak.ImageAugmentation()(processed_input) output_1 = ak.ResNetBlock(version='v2', pretrained=False)(processed_input) output_node = EmbeddingHead(embedding_size=512)(output_1) # throws not callable error

clf = ak.AutoModel(

inputs=input_node,
outputs=output_node,
overwrite=True,
distribution_strategy=tf.distribute.MirroredStrategy(),
max_trials=100,                                                     #  runs till 20-30 trials before - #1479 bug , #175 tuner-bug
max_model_size=100000000

)

————

Hyper-Model 2

works but feature collapse, also sometimes when ran longer ends up with many oversized models with bfc allocator GPU OOM errors

clf = ImageEmbedder(

overwrite=True,
max_model_size=100000000,
max_trials=100,                                                     #  runs till 20-30 trials before - #1479 bug , #175 tuner-bug
distribution_strategy=tf.distribute.MirroredStrategy(),
embedding_size=512

)

————

tensorboard_callback = keras.callbacks.TensorBoard(log_dir=model_dir_path + '/automodel-log')

clf.fit(train_data, epochs=5, callbacks=[tensorboard_callback])

shun-lin commented 3 years ago

Hi, it looks like the EmbeddingHead is callable (see screenshot below), does the EmbeddingHead in your implementation inherits from head_module.Head? This class has __call__ implemented so should be callable. Sorry for low quality screenshot I am on my phone 😂

93F34009-4B38-4C85-8AAE-108B632E0E13

sidphbot commented 3 years ago

Hi Thanks for confirming, it is really wierd i am still getting the error in pycharm syntax highlighting

image

but the code still runs although, strangely if the data is not shuffled, the model collapses (loss = margin) as some batches will contain samples from just one class hence cannot be constrasted by any contrastive loss triplet loss included.

when the data is shuffled, the loss is always nan, i tried some tfds datasets(mnist and cifar100) too in as_supervised mode along with the dataset I am working with, the results were always same.

I could be doing something wrong, can you provide a working notebook where there is any tangible model convergence recorded, no matter the data

shun-lin commented 3 years ago

Hi,

I think the above example notebook for Autokeras + Deep Metrics Learning has the example with cifar10 and it has tangible loss 0.4605 with only 2 epochs, may you fork the notebook and test it? (It used the stock implmentation of EmbeddingHead and ImageEmbedder), I haven't tried it on NMIST and CIFAR100.

shun-lin commented 3 years ago

and also @sidphbot about how many classes is your dataset? If you are not generating triplets from the input (instead of relying on the default sampling) you may want larger batch size (don't forget to adjust your learning rate accordingly if needed) so your batch contains at least 2 samples from the same class (one for the anchor and another for positive), most of the DML losses implemented on TFA assumed that. But if you can't increase your batch size, you may want to want to do something similar to https://stackoverflow.com/questions/55484923/how-to-make-dataset-for-triplet-loss for your input.

sidphbot commented 3 years ago

Thanks for your help and apologies for the delay, it was a bit difficult to get the large datasets I am working with to run on the code because of limited resources.

However, now that it runs fine my inference would be that it works great for smaller datasets(successfully tested with ~0.87 top-5 retreival accuracy score on custom validation loop on a grocery store dataset with 80 classes).

however the performance decreases with number of classes. With huge datasets like Products10k, with 9690 classes, the loss is pretty much always nan. Also the huge batch size requirement is a bit difficult for huge datasets and/or limited resources, so for huge datasets I am moving on to implement Arcface loss with a product clustering multi-task model which generalizes a bit better to dataset size.

shun-lin commented 3 years ago

I haven't play around w/ Products10K and I will try to see if I can make the ImageEmbedder work with it (I think a custom batch generation may be needed like given a batch size, ex: n=64, generate random sample of n/2 classes w/ 2 samples each).

sidphbot commented 3 years ago

Hi, just an update i made something for dml with autokeras based on arcface. Do give it a look if you like https://github.com/sidphbot/AutoKeras-ArcFaceHead

shun-lin commented 3 years ago

Thanks, quick Q: why do you pass in the y_train as second input?

sidphbot commented 3 years ago

Thanks for asking that, i have been meaning to write about it and also get opinions but did not get time. The labels being taken as input may seem ambiguous but it is passed at the last only for computing loss and is not passed through any learnable layers, you can verify the same inside the arcface layer class implementation.

If you use the extracted encoder for a separate validation for image retrieval using a knn or index like NearPy(ANN) you can verify the test accuracy of about 85% top1 and 88% top5 accuracy score on unseen query images(the evaluated hyper-model has some differences so scores may vary slightly).

It might be possible to separate the logic for loss computation but i have left it according to the keras-arcface implementation linked in my page, i will look onto it. However I will shortly upload the validation example on the feature encoder.

Do let me know your thoughts on this.