unsloth/README.md at main · unslothai/unsloth

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

Related issues

134: marker: Convert PDF to markdown quickly with high accuracy

### Details

Similarity score: 0.9 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme) ## Marker Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk. - Support for a range of PDF documents (optimized for books and scientific papers) - Removes headers/footers/other artifacts - Converts most equations to latex - Formats code blocks and tables - Support for multiple languages (although most testing is done in English). See `settings.py` for a language list. - Works on GPU, CPU, or MPS

More Details

## How it works Marker is a pipeline of deep learning models: - Extract text, OCR if necessary (heuristics, tesseract) - Detect page layout ([layout segmenter](https://huggingface.co/vikp/layout_segmenter), [column detector](https://huggingface.co/vikp/column_detector)) - Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base)) - Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor_t5)) Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.` In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages. Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass. ## Examples | PDF | Type | Marker | Nougat | |-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------| | [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkpython.md) | | [Think OS](https://greenteapress.com/thinkos/thinkos.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkos.md) | | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) | | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) | ## Performance ![Benchmark overall](data/images/overall.png) The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000. See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks. # Limitations PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address: - Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation. - Whitespace and indentations are not always respected. - Not all lines/spans will be joined properly. - Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not. - This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors. # Installation This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer). First, clone the repo: - `git clone https://github.com/VikParuchuri/marker.git` - `cd marker` ## Linux - Install system requirements - Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `scripts/install/tesseract_5_install.sh`. - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`. - Install other requirements with `cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y` - Set the tesseract data folder path - Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple. - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it - Install python requirements - `poetry install` - `poetry shell` to activate your poetry venv - Update pytorch since poetry doesn't play nicely with it - GPU only: run `pip install torch` to install other torch dependencies. - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions. ## Mac - Install system requirements from `scripts/install/brew-requirements.txt` - Set the tesseract data folder path - Find the tesseract data folder `tessdata` with `brew list tesseract` - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it - Install python requirements - `poetry install` - `poetry shell` to activate your poetry venv # Usage First, some configuration: - Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default. - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`. - Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors. - Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables. - By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`. - By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the `OCR_ENGINE` setting. ## Convert a single file Run `convert_single.py`, like this: ``` python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10 ``` - `--parallel_factor` is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default. - `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document. Make sure the `DEFAULT_LANG` setting is set appropriately for your document. ## Convert multiple files Run `convert.py`, like this: ``` python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000 ``` - `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU. - `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder. - `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is: - `--min_length` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down) ``` { "pdf1.pdf": {"language": "English"}, "pdf2.pdf": {"language": "Spanish"}, ... } ``` ## Convert multiple files on multiple GPUs Run `chunk_convert.sh`, like this: ``` MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out ``` - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format. - `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater. - `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`. - `MIN_LENGTH` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down) # Benchmarks Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison. **Speed** | Method | Average Score | Time per page | Time per document | |--------|---------------|---------------|-------------------| | naive | 0.350727 | 0.00152378 | 0.326524 | | marker | 0.641062 | 0.360622 | 77.2762 | | nougat | 0.629211 | 3.77259 | 808.413 | **Accuracy** First 3 are non-arXiv books, last 3 are arXiv papers. | Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf | |--------|------------------|-----------|-----------------|-------------|--------------|-----------------| | naive | 0.244114 | 0.140669 | 0.0868221 | 0.366856 | 0.412521 | 0.468281 | | marker | 0.482091 | 0.466882 | 0.537062 | 0.754347 | 0.78825 | 0.779536 | | nougat | 0.696458 | 0.552337 | 0.735099 | 0.655002 | 0.645704 | 0.650282 | Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000. **Throughput** Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000. ![Benchmark results](data/images/per_doc.png) ## Running your own benchmarks You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip. Then run `benchmark.py` like this: ``` python benchmark.py data/pdfs data/references report.json --nougat ``` This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each. Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow. # Commercial usage Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh. Here are the non-commercial/restrictive dependencies: - LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base) - Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat) - PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright) Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript). # Thanks This work would not have been possible without amazing open source models and datasets, including (but not limited to): - Nougat from Meta - Layoutlmv3 from Microsoft - DocLayNet from IBM - ByT5 from Google Thank you to the authors of these models and datasets for making them available to the community!

### #456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/) Baseline Benchmark for 17 Coding Models ========================================= Discussion ---------- I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers. **Notes:** * This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect). * I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine. * AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK. * Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp. * Each model was prompted according to the model card template. Here's an example for the codellama series - ```python f"""You are a helpful and respectful assistant. Answer the following question: {question}""" ``` Results ------- I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots [here](https://imgur.com/a/autpnfK). | Model | Temp | Correct / 164 | Percentage | | --- | --- | --- | --- | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.0 | 67 | 0.40853658536585363 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.1 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.2 | 68 | 0.4146341463414634 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.3 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.4 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.5 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.6 | 54 | 0.32926829268292684 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.7 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.8 | 60 | 0.36585365853658536 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.9 | 59 | 0.3597560975609756 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 1.0 | 65 | 0.39634146341463417 | #### Suggested labels #### { "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

### #160: sid321axn/tinyllama-text2sql-finetuned at main
### Details
Similarity score: 0.89 ## tiny-llama-text2sql ## safetensors - [ ] [sid321axn/tinyllama-text2sql-finetuned at main](https://huggingface.co/sid321axn/tinyllama-text2sql-finetuned/tree/main) ## adapter https://huggingface.co/sid321axn/tiny-llama-text2sql This model is a fine-tuned version of [PY007/TinyLlama-1.1B-Chat-v0.3](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) on the None dataset. ```json { "_name_or_path": "PY007/TinyLlama-1.1B-Chat-v0.3", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 5632, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 22, "num_key_value_heads": 4, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.37.0.dev0", "use_cache": false, "vocab_size": 32003 } ```
### #499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.
### Details
Similarity score: 0.89 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq) # CTransformers [![PyPI version](https://badge.fury.io/py/ctransformers.svg)](https://badge.fury.io/py/ctransformers) [![Documentation](https://readthedocs.org/images/button/readthedocs-ci.svg)](https://ctransformers.readthedocs.io/) [![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see [ChatDocs](https://github.com/marella/chatdocs) ## Supported Models | Model | Model Type | CUDA | Metal | | ------ | --------- | :--: | :--: | | GPT-2 | gpt2 | | | | GPT-J, GPT4All-J | gptj | | | | GPT-NeoX, StableLM | gpt_neox | | | | Falcon | falcon | ✅ | | | LLaMA, LLaMA 2 | llamai | ✅ | ✅ | | MPT | mpt | ✅ | | | StarCoder, StarChat | gpt_bigcode | ✅ | | | Dolly V2 | dolly-v2 | | | | Replit | replit | | | ## Installation To install via `pip`, simply run: ``` pip install ctransformers ``` ## Usage It provides a unified interface for all models: ```python from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2") print(llm("AI is going to")) ``` Run in Google Colab To stream the output: ```python for text in llm("AI is going to", stream=True): print(text, end="", flush=True) ``` You can load models from Hugging Face Hub directly: ```python llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml") ``` If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using: ```python llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin") ``` ### 🤗 Transformers Note: This is an experimental feature and may change in the future. To use with 🤗 Transformers, create the model and tokenizer using: ```python from ctransformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) tokenizer = AutoTokenizer.from_pretrained(model) ``` Run in Google Colab You can use 🤗 Transformers text generation pipeline: ```python from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("AI is going to", max_new_tokens=256)) ``` You can use 🤗 Transformers generation parameters: ```python pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1) ``` You can use 🤗 Transformers tokenizers: ```python from ctransformers import AutoModelForCausalLM from transformers import AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo. tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo. ``` ### LangChain It is integrated into LangChain. See LangChain [docs](https://github.com/LangChainAI/langchain#using-ctransformers-backed-models). ### GPU To run some of the model layers on GPU, set the `gpu_layers` parameter: ```python llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50) ``` Run in Google Colab #### CUDA Install CUDA libraries using: ```bash pip install ctransformers[cuda] ``` #### ROCm To enable ROCm support, install the `ctransformers` package using: ```bash CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers ``` #### Metal To enable Metal support, install the `ctransformers` package using: ```bash CT_METAL=1 pip install ctransformers --no-binary ctransformers ``` ### GPTQ Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https ://github.com/TheLastBen/exllama). Install additional dependencies using: ```bash pip install ctransformers[gptq] ``` Load a GPTQ model using: ```python llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") ``` Run in Google Colab If the model name or path doesn't contain the word `gptq`, specify `model_type="gptq"`. It can also be used with LangChain. Low-level APIs are not fully supported. ## Documentation Find the documentation on [Read the Docs](https://ctransformers.readthedocs.io/). #### Config | Parameter | Type | Description | Default | | --------- | ------ | ------------------------------------------------------------ | ------- | | `top_k` | `int` | The top-k value to use for sampling | `40` | | `top_p` | `float` | The top-p value to use for sampling | `0.95` | | `temperature` | `float` | The temperature to use for sampling | `0.8` | | `repetition_penalty` | `float` | The repetition penalty to use for sampling | `1.1` | | `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty | `64` | | `seed` | `int` | The seed value to use for sampling tokens | `-1` | | `max_new_tokens` | `int` | The maximum number of new tokens to generate | `256` | | `stop` | `List` | A list of sequences to stop generation when encountered | `None` | | `stream` | `bool` | Whether to stream the generated text | `False` | | `reset` | `bool` | Whether to reset the model state before generating text | `True` | | `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt | `8` | | `threads` | `int` | The number of threads to use for evaluating tokens | `-1` | | `context_length` | `int` | The maximum context length to use | `-1` | | `gpu_layers` | `int` | The number of layers to run on GPU | `0` | Find the URL for the model card for GPTQ [here](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq). --- Made with ❤️ by [marella](https://github.com/marella) #### Suggested labels #### null
### #386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face
### Details
Similarity score: 0.89 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1) #### Getting Started The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas! To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code: ```python from datasets import load_dataset import json import numpy as np # To stream the entire dataset: ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True) # Optional, stream just the "arxiv" dataset # ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True) # To process the entries: for entry in ds: embeddings = np.frombuffer( entry['embeddings'], dtype=np.float32 ).reshape(-1, 768) text_chunks = json.loads(entry['text_chunks']) metadata = json.loads(entry['metadata']) print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}') break ``` A full set of scripts to recreate the dataset from scratch can be found [here](https://github.com/SciPhi-AI/agent-search/tree/main/scripts). Further, you may check the docs for details on how to perform RAG over AgentSearch. #### Languages English. #### Dataset Structure The raw dataset structure is as follows: ```json { "url": ..., "title": ..., "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."}, "text_chunks": ..., "embeddings": ..., "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2" } ``` #### Dataset Creation This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets. To cite our work, please use the following: @software{SciPhi2023AgentSearch, author = {SciPhi}, title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search}, year = {2023}, url = {https://github.com/SciPhi-AI/agent-search} } #### Source Data @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} } @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } #### License Please refer to the licenses of the data subsets you use. - Open-Web ([Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use/)) - Books: the\_pile\_books3 license and pg19 license - ArXiv Terms of Use - Wikipedia License - StackExchange license on the Internet Archive #### Suggested labels #### { "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }
### #383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face
### Details
Similarity score: 0.88 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

irthomasthomas / undecidability