Open irthomasthomas opened 8 months ago
/# Time LLM Embed Multi Commands
This markdown document contains the output of various llm embed-multi
commands used for embedding data into an embeddings database. Each command is separated by a horizontal rule for easy reading.
time llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs $(llm logs path) --sql 'SELECT id, prompt FROM logs.responses LIMIT 1000' -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 10
Output:
llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs --sql -m 7142.22s user 3449.01s system 1320% cpu 13:21.98 total
time llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs --sql -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 1
Output:
llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs --sql -m 25.44s user 11.39s system 130% cpu 28.310 total
Getting Started
The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!
To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:
A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.
Languages
English.
Dataset Structure
The raw dataset structure is as follows:
Dataset Creation
This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.
To cite our work, please use the following:
@software{SciPhi2023AgentSearch, author = {SciPhi}, title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search}, year = {2023}, url = {https://github.com/SciPhi-AI/agent-search} }
Source Data
@ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }
@misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} }
@software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} }
License
Please refer to the licenses of the data subsets you use.
Suggested labels
{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }