Open simrankaurjolly16 opened 3 years ago
I believe depending on your task, you first need to pass the query
table and target
table to the blocking function.
While doing so, you so will reduce the number of candidates for each query
and target
partition. Then for each partition you need to create the data in a way that each query is mapped to all the targets.
So if you have 10 queries and 10 targets in a single partition, you will be creating a dataset of size 100. After which you apply pre-processing, i.e applying special tokens, etc.. and perform inference. Then you would get the scores for each of the pairs and select the one with the highest pair
Ideally, we would like to extract the final layer from the Ditto
model and index the target embeddings using ANN like faiss. Then for the query, you just have to get the embedding and do a fast lookup. This is much more scalable.
The original author of BERT made a comment saying that BERT is not pretrained for semantic similarity
https://github.com/google-research/bert/issues/164#issuecomment-441324222. You might get poor results, even worse than simple Glove Embeddings
I am doing a small experiment with Ditto model on both of the approaches and will update this space when I get some results
How to do inferencing after running the training on unseen data?