qqaatw / pytorch-realm-orqa

PyTorch reimplementation of REALM and ORQA
Apache License 2.0
22 stars 2 forks source link

Can you please explain about the wiki data format #3

Open shamanez opened 2 years ago

shamanez commented 2 years ago

Are they TF records?

qqaatw commented 2 years ago

Thanks for interested in this repo.

Yes, they're TF records. However, we convert them to .npy format to prevent TF dependencies. You can see the data section on the readme and the converting script.

I also noticed that you asked which vector similarity searching library is used in the project in another issue, this is definitely worth to ask and not a spam. The current implementation in transformers is using brute-force matrix searching, which is pure PyTorch operations and doesn't rely on other libraries. I'm considering to add a FAISS support instead of ScaNN as FAISS is more compatible with PyTorch and transformers; on the other hand, ScaNN requires specific TF version installed, which is what I want to avoid.