Linux Python 3.9.19 PyTorch 2.0.1
Please refer to requirements.txt for detail environments
cd model_finetune/rag-retrieval/embedding
bash train_embedding.sh
Note that you need to set up your relevant parameters (e.g., model name, number of negative examples) in .sh file before running it.
cd model_finetune/rag-retrieval/reranker
bash train_rerank.sh
Note that you need to set up your relevant parameters (e.g., model name, number of negative examples) in .sh file before running it.
python3 src/build_retriever.py
Note that you need to set up your relevant parameters (e.g., embedding model path, save_path) in build_retriever.py
file before running it.
python3 src/get_related_doc.py
See the code comments for details on how to use. You need to specify the model name of model path in the code to produce your results for each model.
python3 src/rerank.py
See the code comments for details on how to use.
python src/preprocess.py
python src/hard_negative_mining.py
See the code comments for details on how to use. We use this code to mine the hard negative examples, according to the similarity between docs and queries.
python3 src/RRF.PY
See the code comments for details on how to use. This code can merge the retrieval result produced by different models, using reciprocal rank fusion(RRF).
We have explored a lot of ways to improve the effectiveness of our model, but some prove to be only useful in the valid dataset (for example, reranker and hard negative mining), so we only apply some of the functions mentioned above in our best-result-model.
cd model_finetune/rag-retrieval/embedding
bash train_embedding.sh
We finetune gte-large-en-v1.5 using contrastive learning.
python3 src/build_retriever.py
python3 src/get_related_doc.py
Here we construct retrievers of five models: gte-large-en-v1.5(finetuned), GritLm-7B, SFR-Embedding-Mistral, NV-Embed-v1, Linq-Embed-Mistral. We retrieve 100 docs for each query by those five retrievers, and get five result files respectively in the result folder.
python3 src/RRF.PY
We finally combine the result of the mentioned fine models using RRF, and after voting, top-20 results of each query are selected as the final results.