megagonlabs / ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Apache License 2.0
259 stars 89 forks source link

Inferencing #18

Open simrankaurjolly16 opened 3 years ago

simrankaurjolly16 commented 3 years ago

How to do inferencing after running the training on unseen data?

ajaybabu20 commented 2 years ago

I believe depending on your task, you first need to pass the query table and target table to the blocking function. While doing so, you so will reduce the number of candidates for each query and target partition. Then for each partition you need to create the data in a way that each query is mapped to all the targets.

So if you have 10 queries and 10 targets in a single partition, you will be creating a dataset of size 100. After which you apply pre-processing, i.e applying special tokens, etc.. and perform inference. Then you would get the scores for each of the pairs and select the one with the highest pair

Ideally, we would like to extract the final layer from the Ditto model and index the target embeddings using ANN like faiss. Then for the query, you just have to get the embedding and do a fast lookup. This is much more scalable. The original author of BERT made a comment saying that BERT is not pretrained for semantic similarity https://github.com/google-research/bert/issues/164#issuecomment-441324222. You might get poor results, even worse than simple Glove Embeddings

I am doing a small experiment with Ditto model on both of the approaches and will update this space when I get some results