Extending Graph regularization to images?

sayakpaul commented 3 years ago

Hi folks.

I am willing to work on a tutorial that shows how to extend graph regularization example in the same way it's done for text-based problems. Is there a scope for this tutorial inside this repo?

arjung commented 3 years ago

Hi Sayak, thanks for your interest! Can you say more about what you mean by extending graph regularization? What are you thinking of demonstrating and what dataset would it be for?

sayakpaul commented 3 years ago

Hi Arjun.

I should have been clearer. I meant to say "graph regularization example" that we have for text classification.

I am thinking of using similar methods on an image dataset (let's say the Flowers dataset). Some brief pointers:

For extracting the features, we could use a pre-trained model (ResNet50, InceptionV3, etc.) that comes with TensorFlow Hub or keras.applications. We could either use linear transfer (frozen features) or even do fine-tuning.
I think after extracting the features it would be pretty much similar to how the text classification example is demoed.

Let me know if anything is unclear.

arjung commented 3 years ago

Ah I see, that makes sense. We'll discuss this with the rest of our group and get back to you later this week.

arjung commented 3 years ago

Do you have a dataset in mind that encodes a natural/organic graph, perhaps something like a co-occurrence graph? We believe that using an orthogonal source of similarity and not just inferring it based on embeddings will be much more valuable for graph regularization, and so it'd be great to demonstrate that if possible. Another option might be to create 'perturbed' versions of images and use them as neighbors for graph regularization to improve the stability/robustness of the model. Let us know what you think.

sayakpaul commented 3 years ago

Do you have a dataset in mind that encodes a natural/organic graph, perhaps something like a co-occurrence graph? We believe that using an orthogonal source of similarity and not just inferring it based on embeddings will be much more valuable for graph regularization, and so it'd be great to demonstrate that if possible.

I don't have that kind of dataset in mind. Since in the text classification example, graph regularization was demonstrated using the IMDB dataset I was more inclined toward using a dataset of natural images.

Another option might be to create 'perturbed' versions of images and use them as neighbors for graph regularization to improve the stability/robustness of the model. Let us know what you think.

It's already kind of covered implicitly in the Adversarial Regularization example, isn't it?

arjung commented 3 years ago

Ideally, our goal should be to demonstrate some wins of using graph regularization in this tutorial. Here are two options for doing that; there could be more:

Option 1:

Use a complex pre-trained model to generate image embeddings and then use embedding similarity for graph building. Then, for the classification task, use a simple(r) model. The hope here is that using the graph will yield some improvements for the classification model because of the more powerful model used to generate embeddings. Let us know if you have other ideas here.

Option 2

Evaluate the robustness/stability of a model using image perturbations. The additional perturbed examples can be used as augmented training data or as neighbors for graph regularization. Note that this is different from the adversarial regularization example because here we'd be generating model-agnostic image perturbations -- for example, cropped, shifted, rotated, blurred, etc, versions of images. We have some work underway along this thread and could potentially collaborate on this if you're interested.

sayakpaul commented 3 years ago

I am interested to collaborate on both. Sounds really interesting. The first idea you mentioned, that is precisely what I had in mind, probably I did not convey that well enough.

Let me know the best possible way to start this off. I have the bandwidth to work on both the cases.

arjung commented 3 years ago

Sounds good. For option 1, feel free to put together what you had in mind and send us a PR. We can plan to have it under https://github.com/tensorflow/neural-structured-learning/tree/master/neural_structured_learning/examples/notebooks.

For option 2, will discuss with the rest of the team to see how we can go about this and circle back.

sayakpaul commented 3 years ago

Alright.

For option 1, here's what I have in mind:

Use a BiT variant on the Flowers dataset to generate the embeddings.
Construct the graph.
Construct a shallow model to train with the graph data.

For option 2, that sounds good to me. I have experience working with those kinds of perturbations and corruptions. I recently worked on assessing the robustness of Vision Transformers against these perturbations and corruptions.

sayakpaul commented 3 years ago

@arjung I started putting together a notebook for option 1.

After experimenting for a while, I am seeing that not all the images are having neighbors. This is likely because of the embeddings being generated by the pre-trained model I am using and also the hyperparameters I am using during constructing the graphs.

To elaborate, here's an example of an entry that does not have any neighbors:

features {
  feature {
    key: "NL_num_nbrs"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "id"
    value {
      bytes_list {
        value: "9"
      }
    }
  }
  feature {
    key: "image"
    value {
      bytes_list {
        value: "..."
      }
    }
  }
  feature {
    key: "label"
    value {
      int64_list {
        value: 1
      }
    }
  }

Here's one that does have neighbors:

features {
  feature {
    key: "NL_nbr_0_id"
    value {
      bytes_list {
        value: "1505"
      }
    }
  }
  feature {
    key: "NL_nbr_0_image"
    value {
      bytes_list {
        value: "..."
      }
    }
  }
  feature {
    key: "NL_nbr_0_label"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "NL_nbr_0_weight"
    value {
      float_list {
        value: 0.7551509737968445
      }
    }
  }
  feature {
    key: "NL_nbr_1_id"
    value {
      bytes_list {
        value: "2860"
      }
    }
  }
  feature {
    key: "NL_nbr_1_image"
    value {
      bytes_list {
        value: "..."
      }
    }
  }
  feature {
    key: "NL_nbr_1_label"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "NL_nbr_1_weight"
    value {
      float_list {
        value: 0.7100009918212891
      }
    }
  }
  feature {
    key: "NL_num_nbrs"
    value {
      int64_list {
        value: 2
      }
    }
  }
  feature {
    key: "id"
    value {
      bytes_list {
        value: "5"
      }
    }
  }
  feature {
    key: "image"
    value {
      bytes_list {
        value: "..."
      }
    }
  }
  feature {
    key: "label"
    value {
      int64_list {
        value: 0
      }
    }
  }
}

How should we handle this situation? Here's the Colab Notebook for full reproducibility. Note that the pre-trained model I used (BiT-m-r50x1) to extract the embeddings yield a vector of shape (1, 2048) (considering we have only one image). I further reduced this to a vector of shape (1, 128) with random projection.

The flowers dataset has only about 3600 examples in total categorized into 5 classes somewhat equally. The number of samples might be an issue. But I still wanted to know your thoughts.

Let me know if anything is unclear.

arjung commented 3 years ago

Thanks for putting together an initial version of the colab quickly. I took a quick peek at it and here are a couple of comments, which I think should address your question.

Since the dataset is quite small in your case, you really don't need to configure LSH for graph building. The 32 splits is likely the reason why the graph is a bit sparse and some of the nodes end up being isolated. If you really did want to configure LSH, I'd suggest reducing the # lsh_splits.
Since some nodes may not have neighbor features in general, the input layer construction code should be tolerant to this. In particular, when you're parsing the image feature for neighbors, you'd want to specify a default value for it in case it doesn't exist. So, change

feature_spec[nbr_feature_key] = tf.io.FixedLenFeature([], tf.string)

to

feature_spec[nbr_feature_key] = tf.io.FixedLenFeature([], tf.string, defalult_value="")

The actual default value of this neighbor feature doesn't matter because the corresponding neighbor weight is set to 0 -- this edge won't contribute to the graph regularization term. The shape has to be compatible with the value in the original example though.

See https://www.tensorflow.org/neural_structured_learning/tutorials/graph_keras_mlp_cora#load_train_and_test_data for how this is done in a different example.

sayakpaul commented 3 years ago

Thank you, @arjung!

I added the following:

feature_default_value = tf.zeros((IMG_SIZE, IMG_SIZE, 3))
feature_default_value = tf.strings.as_string(feature_default_value, precision=2)
feature_spec[nbr_feature_key] = tf.io.FixedLenFeature([], tf.string,
                                                    default_value=feature_default_value)

I also reduced the lsh_splits to 10 and as expected we have more neighbors - 681358.

With this, when I am trying to parse the TFRecords (built with NSL augmentation), I'm running into:

InvalidArgumentError: def_value[0].shape() == [384,384,3] is not compatible with dense_shapes_[0] == []
     [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]] [Op:IteratorGetNext]

Now when prefixed the shapes like so (following the tutorial), I got another error.

feature_default_value = tf.zeros((IMG_SIZE, IMG_SIZE, 3))
feature_default_value = tf.strings.as_string(feature_default_value, precision=2)
feature_spec = {
        'image': tf.io.FixedLenFeature([IMG_SIZE, IMG_SIZE, 3], tf.string, 
                                       default_value=feature_default_value),
        'label': tf.io.FixedLenFeature((), tf.int64, default_value=-1),
}

...

feature_spec[nbr_feature_key] = tf.io.FixedLenFeature([IMG_SIZE, IMG_SIZE, 3], tf.string,
                                                    default_value=feature_default_value)

Issue:

ValueError: in user code:

    <ipython-input-32-21651c7f1f5a>:51 parse_example  *
        features['image'] = tf.image.decode_jpeg(features['image'], channels=3)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_image_ops.py:1202 decode_jpeg  **
        dct_method=dct_method, name=name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py:601 _create_op_internal
        compute_device)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:3565 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:2042 __init__
        control_input_ops, op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:1883 _create_c_op
        raise ValueError(str(e))

    ValueError: Shape must be rank 0 but is rank 3 for '{{node DecodeJpeg}} = DecodeJpeg[acceptable_fraction=1, channels=3, dct_method="", fancy_upscaling=true, ratio=1, try_recover_truncated=false](ParseSingleExample/ParseExample/ParseExampleV2:4)' with input shapes: [384,384,3].

You can refer to the same Colab Notebook mentioned here in case you want to take a look.

arjung commented 3 years ago

As the error indicates, tf.image.decode_jpeg() expects its input to have rank 0, so you'd have to take care of its shape requirements. See https://www.tensorflow.org/api_docs/python/tf/io/decode_jpeg for documentation on its arguments.

sayakpaul commented 3 years ago

Will look into it. But I guess I have already tried the next thing. When the feature_spec of the images are not specified with shapes ([]. rank-0) it results into:

InvalidArgumentError: def_value[0].shape() == [384,384,3] is not compatible with dense_shapes_[0] == []
     [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]] [Op:IteratorGetNext]

I indicated this in the first part of my previous comment.

arjung commented 3 years ago

feature_default_value = tf.zeros((IMG_SIZE, IMG_SIZE, 3))
feature_default_value = tf.strings.as_string(feature_default_value, precision=2)
feature_spec[nbr_feature_key] = tf.io.FixedLenFeature([], tf.string, default_value=feature_default_value)

The above code will not work because for the neighbor feature, you're specifying the shape as rank 0 but then specifying a default value with shape [img_size, img_size, 3].

sayakpaul commented 3 years ago

Yes, that is why I tried the other one but then ran into the rank issue. Any approach you can think of to mitigate it? I understand we need to deal with the shape requirements but it's not immediately clear to me how we could do that here.

arjung commented 3 years ago

One option is to specify the default value as a JPEG-encoded string with rank 0. Your code already handles decoding from jpeg to integer tensors after parse_example().

Alternatively, if you don't need the back-and-forth JPEG conversion, you can serialize the examples (output of augmentation) to contain an int64_list for the 'image' feature -- this is the format of the feature in the dataset to begin with. If you do that, then you specify the default value for the 'image' feature as an integer tensor with shape [384, 384, 3].

sayakpaul commented 3 years ago

Thanks.

The code now works. Here's the Colab.

My base model is:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
image (InputLayer)           [(None, 384, 384, 3)]     0         
_________________________________________________________________
rescaling_3 (Rescaling)      (None, 384, 384, 3)       0         
_________________________________________________________________
global_average_pooling2d_3 ( (None, 3)                 0         
_________________________________________________________________
dense_9 (Dense)              (None, 64)                256       
_________________________________________________________________
dense_10 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_11 (Dense)             (None, 5)                 165       
=================================================================
Total params: 2,501
Trainable params: 2,501
Non-trainable params: 0
_________________________________________________________________

I think we can agree that this is way simple to deal with 224x224x3 images but I just wanted to get something up and running quickly. In this case, I saw graph regularization did play an important part. Over five runs, I was able to squeeze out at least 1-2% improvement over the base model.

I have also added a visualization snippet to allow folks to get a deeper insight into the neighbors being formed by NSL:

Let me know if I should proceed toward including the text pieces on this and any additional feedback you may have.

arjung commented 3 years ago

Thanks. I've been a bit busy over the past 2 weeks and haven't had a chance to look into this. Will get to it over the next few days.

sayakpaul commented 3 years ago

I appreciate that. Thanks.

arjung commented 3 years ago

Looks generally good, Sayak! I'll take another look once you send the PR. Please add sufficient documentation, doc strings, etc. A few comments/suggestions for now though regarding the results:

the graph regularized model doesn't seem to have converged during training
the scaled graph loss value seems to be quite low and so I am not sure if it's contributing much.

Can you try increasing the graph regularization multiplier and try increasing the # epochs? Another thing to potentially experiment with is the similarity threshold for the graph. If you increase it to > 0.65, is there a difference in the final model quality?

sayakpaul commented 3 years ago

the graph regularized model doesn't seem to have converged during training

Yes, I didn't train it to completion. I wanted to just briefly run the models to ensure they are working.

the scaled graph loss value seems to be quite low and so I am not sure if it's contributing much.

What we should expect to see there? This would give me a good idea as I experiment further.

Can you try increasing the graph regularization multiplier and try increasing the # epochs? Another thing to potentially experiment with is the similarity threshold for the graph. If you increase it to > 0.65, is there a difference in the final model quality?

I'll experiment with all of these and report back.

Thanks, Arjun!

sayakpaul commented 3 years ago

@arjung here are some observations from the recent set of experiments I conducted:

When I swapped the shallow neural network (comprising only two fully connected layers) with a shallow CNN the performance improved drastically.
With 384x384 shaped images, the graph regularized model was throwing ResourceExhasutedError during training. So, I reduced it down to 224x224. I also increased the similarity threshold and regularization multiplier. Both of them lead to better performance.
Without graph regularization, the model underfits when compared to the model trained with it. The validation performance also improves significantly with graph-reg. Similar observations also hold when the pre-trained model is changed to a DenseNet121. Here's the Colab Notebook.

Some points to note:

The pre-trained model (BiT) I am using is known to perform well across a number of different regimes. BiT has many variants. I used the smallest one. So, chances are likely that with a bigger one we might be able to improve the scores further.
I am using random projections to reduce the dimensionality of the pre-trained embeddings since they are either 2048-d (for BiT) or 1024-d (for DenseNet121).

Here's the Colab Notebook (BiT) where all these are reflected. Let me know how you would want me to proceed or if anything is unclear.

arjung commented 3 years ago

I don't know what caused the ResourceExhaustedError but batch size could also be one reason.
I am not able to see the underfitting problem. It looks like the performance on the training set is good for both models. However, the validation loss starts increasing even after 5 epochs in both cases even though the validation accuracy mostly remains flat thereafter.
Is there a reason you want to reduced the dimensionality of the embeddings for the purpose of graph building? Did you try using the higher dimensional embeddings when building the graph?
Feel free to experiment with a larger pre-trained model if you think it'll give us better results.

sayakpaul commented 3 years ago

Without graph regularization, the model underfits when compared to the model trained with it. The validation performance also improves significantly with graph-reg.

What I meant by this is the training accuracy is not on par with what we get with graph regularization. Without graph regularization, the training accuracy stays at 89%. With graph regularization incorporated, it reaches a substantially higher training as well validation accuracy. This is evident in both the notebooks I mentioned in my previous comment.

Is there a reason you want to reduced the dimensionality of the embeddings for the purpose of graph building? Did you try using the higher dimensional embeddings when building the graph?

1024 (DenseNet121) and 2048 (BiT-ResNet) dimensions did not seem very practical when serializing the embeddings. I will do a small ablation with the embedding dimensionality and will update the results in the next iteration.

sayakpaul commented 3 years ago

Updates.

Looks like the discrepancy between the training and validation performance can be easily fixed by reducing the batch size.

^{No graph-reg}

^Graph-reg

Also, here is a short ablation study:

Embedding Dim	With Graph-reg	Without Graph-reg
128	58.36%	56.55%
256	57.82%	55.45%
384	58%	60.18%
512	57.45%	57.64%

The table above reports the validation top-1 accuracies with and without graph regularization under different reduced embedding dimensionalities. Notice that after 256-d the performance trade-off reverses. Maybe a bit more tinkering with hyperparameters like similarity_threshold and graph_regularization_multiplier is needed to fully study this behavior.

arjung commented 3 years ago

Thanks. Are these the embedding dimensionality values used to build the graph, used in the input layer of the classifier, or both?

You mentioned you were going to try a larger BiT model too? Is that still part of your plan? In general, I think all of these experiments are useful and it would be great to have the findings summarized at the end of the colab.

sayakpaul commented 3 years ago

Thanks. Are these the embedding dimensionality values used to build the graph, used in the input layer of the classifier, or both?

For building the graph. Images go directly to the subsequent classifier.

You mentioned you were going to try a larger BiT model too? Is that still part of your plan?

Sure I will do that.

Do you think now we have a good ground to start working on the tutorial based on the notebook? If so, I can work on it and have a PR ready.

arjung commented 3 years ago

Yeah definitely, please go ahead. Thanks!

tensorflow / neural-structured-learning