Open sayakpaul opened 3 years ago
This looks good but we need to validate two things:
For the first one, we need to add a metric
, the most popular being editdistance
I can look into incorporating the edit distance metric. Funnily, it has got varied names in different kinds of literature.
After that, would you like to experiment with 2.? To maintain the brevity of the example I suggest making it (the one with edit distance and self-attention) a separate one. But I am open to your thoughts.
Yes, that sounds good to me. Let's push that first, we can always make another PR for the addons
@AakashKumarNain since the prediction model is different from the main training model here's how I am envisioning the evaluation with edit distance.
We train the model as it is and then extract the prediction model. After that, we run the edit distance evaluation. Sample code that I have on mind is (just for a single batch):
# Get a single batch and convert its labels to sparse tensors.
test_batch = next(iter(test_ds))
saprse_labels = tf.cast(
tf.sparse.from_dense(test_batch["label"]), dtype=tf.int64
)
# Make predictions and convert them to sparse tensors.
predictions = prediction_model.predict(test_batch)
input_len = np.ones(predictions.shape[0]) * predictions.shape[1]
predictions_decoded = keras.backend.ctc_decode(predictions, input_length=input_len, greedy=True)[0][0][
:, :max_len
]
sparse_predictions = tf.cast(
tf.sparse.from_dense(predictions_decoded), dtype=tf.int64
)
# Compute individual edit distances and average them out.
edit_distances = tf.edit_distance(
sparse_predictions, saprse_labels, normalize=False
)
mean_edit_distance = tf.reduce_mean(edit_distances)
@sayakpaul why not add a callback for metric evaluation during training as well?
@AakashKumarNain does this work?
@sayakpaul I think this can be made more simple. I will try refining it today
@AakashKumarNain please share what you have in mind. I can also work on further simplifying it from there. But simplification should not lead to hampering the readability aspect IMO.
Instead of defining a prediction model every time we hit the callback, can't we just make a shallow copy of the model weights and just reuse that?
Do you mean initialize a prediction model class and load the updated weights every time the callback is hit? But that would still require subclassing the main model (that contains the CTC layer) and then extracting the weights, no?
Yes, that is true!
Yeah, so that does not introduce a whole lot of improvements to the current base IMO. But please correct me if I am missing something.
Agreed. I will review it once more in the evening and will let you know and then we can proceed with it
@sayakpaul check this out: https://drive.google.com/file/d/1_aixkxpDKlDQe2FICtPzmKqnHUYxnTNd/view?usp=sharing
Bohot khub.
@AakashKumarNain feel free to incorporate it in the PR directly. Neat.
Also, WDYT about the self-attention part? Should we cover that in a follow-up example?
Yes, we can push this one for now. For self-attention, we will make another example
SGTM. Curious to see the results with SA 🧐
@sayakpaul can you push the changes with edit distance callback? It will be good if only one of us push the changes to that PR. Will be less cluttered IMO
Okay sir, will do. 💻
Thanks a lot :beers:
@AakashKumarNain just wanted to circle back to this part of the blog post. Happy to help with anything you might need.
@sayakpaul I got busy with work. Will get back to this soon
@sayakpaul I ran some experiments today. Although attention did provide some improvements, the improvements aren't that huge. I will try to showcase it in a colab side-by-side soon
Okay. Maybe we need to reformulate how it's being used currently.
@AakashKumarNain I have been thinking more about this lately.
Since the characters inside the images are sort of presented in a tight-nit manner, I doubt if incorporating self-attention would provide that extra boost. I doubt if it will help the model to learn contextual dependencies any more than what the CNN part of our model is already doing.
Happy to brainstorm more and design experiments, though.
@sayakpaul yes, in my experiments both the models (with and without self-attention) performs almost the same. Let's take this offline and discuss the next steps
I wanted to also incorporate self-attention into the model to make the example a bit more interesting and fun.
Here's how I am doing it currently:
The visualization looks like so:
Here's the Colab. Note that the results are from 10 epochs of training.
Wanted to get your thoughts. @AakashKumarNain