irthomasthomas / undecidability

6 stars 2 forks source link

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

Open irthomasthomas opened 8 months ago

irthomasthomas commented 8 months ago

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch, author = {SciPhi}, title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search}, year = {2023}, url = {https://github.com/SciPhi-AI/agent-search} }

Source Data

@ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }

@misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} }

@software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} }

License

Please refer to the licenses of the data subsets you use.

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

irthomasthomas commented 8 months ago

/# Time LLM Embed Multi Commands

This markdown document contains the output of various llm embed-multi commands used for embedding data into an embeddings database. Each command is separated by a horizontal rule for easy reading.

CPU Batch 10 (The maximum I can run in 64GB)

time llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs $(llm logs path) --sql 'SELECT id, prompt FROM logs.responses LIMIT 1000' -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 10

Output:

llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs  --sql  -m  7142.22s user 3449.01s system 1320% cpu 13:21.98 total

GPU

time llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 1

Output:

llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m  25.44s user 11.39s system 130% cpu 28.310 total