qqaatw / pytorch-realm-orqa

PyTorch reimplementation of REALM and ORQA
Apache License 2.0
22 stars 2 forks source link

How to use this repository with custom knowledge-base ? #5

Open shamanez opened 2 years ago

shamanez commented 2 years ago

Similar to the RAG in the Transformers library can we use a custom KB?

qqaatw commented 2 years ago

Did you mean updating custom document entries to the index for retrieval during fine-tuning?

shamanez commented 2 years ago

Yes exactly.

qqaatw commented 2 years ago

Although the paper mentions that

While asynchronous refreshes can be used for both pre-training and fine-tuning, in our experiments we only use it for pre-training. For fine-tuning, we just build the MIPS index once (using the pre-trained θ) for simplicity and do not update Embed_doc. Note that we still fine-tune Embed_input, so the retrieval function is still updated from the query side.

The TF implementation of REALM, i.e. their experiments, freezes the evidence blocks and thereby doesn't update indexes when fine-tuning, meaning that adding custom documents is not allowed.

Therefore, if any custom documents needed for retrieval, we have to either

  1. run pre-training process to get new evidence blocks that embed custom documents. This approach follows their experiments.
  2. implement asynchronous refreshes for fine-tuning ourselves. This approach is not guaranteed and I haven't come up with the detail of how to best integrate this approach with the existing models in transformers.
  3. (edited) use query embedder to embed custom document entries and then concatenate them with existing block embeddings for retrieval. This is probably the simplest approach to add custom documents, though it's not guaranteed too. I'll firstly add this to the repo.
qqaatw commented 2 years ago

Option 3 has been implemented, please check out the readme and see if that matches your need :-)