harsha-simhadri / big-ann-benchmarks

Framework for evaluating ANNS algorithms on billion scale datasets.
https://big-ann-benchmarks.com
MIT License
356 stars 118 forks source link

Add OpenAI ArXiv Dataset #299

Closed magdalendobson closed 3 months ago

magdalendobson commented 3 months ago

This PR adds a 2 million size embedding dataset of 1536-dimensional OpenAI ada-002 embeddings of the abstracts of ArXiv papers. The original ArXiv dataset was released by Cornell University on kaggle under a CC0 license. We provide a set of 20000 queries also embedded from the abstracts of ArXiv articles, as well as groundtruth for the first 100000 vectors and the full 2321096 vectors.

magdalendobson commented 3 months ago

Added comment in datasets.py describing the dataset, marking as ready for review.