oneal2000 / DRAGIN

Source code of DRAGIN, ACL 2024 main conference Long Paper
68 stars 11 forks source link

Get SGPT retriever working #4

Open drdsgvo opened 5 months ago

drdsgvo commented 5 months ago

In class SGPT/init (retriever.py) it seems to be that the code looks for .pt files to load from directory given by config parameter "sgpt_encode_file_path":

filename = f"{i}_{j}.pt" pbar.update(1) tp = torch.load(os.path.join(encode_file_path, filename))

But in your documentation it is said that the parameter "sgpt_encode_file_path" is an output directory.

The directory is empty after running with retriever SGPT. In method retrieve (retriever.py) this leads to an error as self.p_reps is empty.

Can you give me a hint what is wrong or what needs to be put into the given directory?

Thank you

LittleDinoC commented 5 months ago

Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.

The parameter sgpt_encode_file_path should be set to the address of the folder where you store the passage embeddings.

filename = f"{i}_{j}.pt" is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing of p_reps.

I hope this helps!

drdsgvo commented 5 months ago

Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.

The parameter sgpt_encode_file_path should be set to the address of the folder where you store the passage embeddings.

filename = f"{i}_{j}.pt" is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing of p_reps.

I hope this helps!

Thank you very much. I could already find the repository you mentioned (by searching for SGPT). However, that's what hinders me programming the SGPT index creation 1) I see some problems with the DRAGIN approach, which all (?) other RAG approaches do seem to have. I will open another issue to point that out. 2) It is doable but takes some time (hours?). Am I the only person looking for that implementation?

QuentinLoriaux commented 4 months ago

Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.

The parameter sgpt_encode_file_path should be set to the address of the folder where you store the passage embeddings.

filename = f"{i}_{j}.pt" is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing of p_reps.

I hope this helps!

I am trying to understand how to encode the corpus and make it match the way it is implemented in retriever.py . Could you explain in detail how you proceeded? First, when you mention corpus, do you mean "psgs_w100.tsv" or am I completely lost?

LittleDinoC commented 3 months ago

To ensure the same experimental setup, the corpus is psgs_w100.tsv.

Please use this code to encode the corpus: https://github.com/Muennighoff/sgpt?tab=readme-ov-file#asymmetric-semantic-search-be. I used the model Muennighoff/SGPT-1.3B-weightedmean-msmarco-specb-bitfit.