swj0419 / in-context-pretraining

33 stars 3 forks source link

Where are the files in 'data/b3g' #1

Open ChaofanTao opened 8 months ago

ChaofanTao commented 8 months ago

Hi, Thanks for your time!

Based on the ReadMe that 'We provide an example corpus in data/b3g to demonstrate our pipeline.', I wonder where are the files in 'data/b3g'?

In addition, for the step 4 Run the search distributed job, there are two commands. Command 1: python run.py --command search --config configs/config_test.yaml --xb ccnet_new --cluster_run --partition learnlab Command 2: python run.py --command search --config configs/config_test.yaml --xb ccnet_new --xq edouard_val

I am confused about the remark. For just one database that has multiple documents, should I run these 2 commands step by step or just command 1 ?

mlomeli1 commented 7 months ago

the faiss OIVFBBS code is more general than what's required for in-context pretraining, see https://github.com/facebookresearch/faiss/tree/main/demos/offline_ivf , e.g. you can have a different dataset for the query vectors than for the database vectors. In this case, you are right, we only use it by searching the document embeddings into themselves so I have removed the remark in the README to avoid confusion.