microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

[llm_retriever] Questions about the dataset #178

Open OStars opened 6 months ago

OStars commented 6 months ago

Hi, thanks for your great job. I run the download_data.sh script and obtain the dataset sucessfully, but I have some questions about what exactly each file means:

  1. What is the difference between passages.jsonl.gz and train.jsonl.gz?
  2. Which bm25 algorithm was used to obtain the bm25_train.jsonl? Can you provide the code or code link of the specific implementation?