Add Wikipedia-Cohere and MS MARCO Web Search Datasets

magdalendobson commented 3 months ago

This PR adds two new datasets to the benchmark suite.

The first dataset is Wikipedia-Cohere, and its base vectors consist of 35 million cohere embeddings of the title and text of Wikipedia English articles. The 5000 query vectors consist of 5000 cohere embeddings of the title and text of Wikipedia simple articles. The embeddings are licensed under an Apache 2.0 license, and we confirmed permission with the authors to host and contribute these datasets to Big ANN Benchmarks. Ground truth for the first 100K, 1M, and 35M vectors are provided.

The second dataset is MS Marco Web Search. Its 100,924,960 base vectors consist of embeddings of web documents from the ClueWeb22 document dataset, while its 9,374 queries correspond to web queries collected from the Microsoft Bing search engine. The authors state that "The MS MARCO Web Search are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights." We confirmed permission with the authors to contribute these datasets. Ground truth for the first 1M, 10M, and 100M vectors are provided.

harsha-simhadri commented 3 months ago

Could you please add a link to their license terms. Other than that, OK to merge.

magdalendobson commented 3 months ago

Done, moving this out of draft mode.

harsha-simhadri / big-ann-benchmarks

Add Wikipedia-Cohere and MS MARCO Web Search Datasets #297