Open drdsgvo opened 7 months ago
Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.
The parameter sgpt_encode_file_path
should be set to the address of the folder where you store the passage embeddings.
filename = f"{i}_{j}.pt"
is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing of p_reps
.
I hope this helps!
Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.
The parameter
sgpt_encode_file_path
should be set to the address of the folder where you store the passage embeddings.
filename = f"{i}_{j}.pt"
is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing ofp_reps
.I hope this helps!
Thank you very much. I could already find the repository you mentioned (by searching for SGPT). However, that's what hinders me programming the SGPT index creation 1) I see some problems with the DRAGIN approach, which all (?) other RAG approaches do seem to have. I will open another issue to point that out. 2) It is doable but takes some time (hours?). Am I the only person looking for that implementation?
Before using SGPT as a retriever, you need to download the SGPT Model from Hugging Face and encode the entire corpus using [the code from SGPT repository](https://github.com/Muennighoff/sgpt) based on the Model you are using. Store the encoding results in the folder you have set.
The parameter
sgpt_encode_file_path
should be set to the address of the folder where you store the passage embeddings.
filename = f"{i}_{j}.pt"
is the storage architecture I use to segment and encode the corpus. You can change the file reading here to match your architecture and retain the subsequent processing ofp_reps
.I hope this helps!
I am trying to understand how to encode the corpus and make it match the way it is implemented in retriever.py . Could you explain in detail how you proceeded? First, when you mention corpus, do you mean "psgs_w100.tsv" or am I completely lost?
To ensure the same experimental setup, the corpus is psgs_w100.tsv
.
Please use this code to encode the corpus: https://github.com/Muennighoff/sgpt?tab=readme-ov-file#asymmetric-semantic-search-be. I used the model Muennighoff/SGPT-1.3B-weightedmean-msmarco-specb-bitfit.
In class SGPT/init (retriever.py) it seems to be that the code looks for .pt files to load from directory given by config parameter "sgpt_encode_file_path":
filename = f"{i}_{j}.pt" pbar.update(1) tp = torch.load(os.path.join(encode_file_path, filename))
But in your documentation it is said that the parameter "sgpt_encode_file_path" is an output directory.
The directory is empty after running with retriever SGPT. In method retrieve (retriever.py) this leads to an error as self.p_reps is empty.
Can you give me a hint what is wrong or what needs to be put into the given directory?
Thank you