I finally got perfect labels (classification task) via prompting : r/LocalLLaMA
DESCRIPTION
"I finally got perfect labels (classification task) via prompting
Tutorial | Guide
It took me weeks of trial and error, but here are my biggest lessons:
Alpaca works REALLY well, even for Mistral/Mixtral instructs
Mixtral8x7b-instruct is the best (in my experience) at in-context learning
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
Split your prompt into 3 sections:
Instructions: Explains the task
Hint: Explains likely mislabeling reasons
Few-shot: Examples w/ reasoning
Below is the plug-n-play template I finalized/am using
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction:
Label the text based on this question: \"{task}\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label.
(Hint: {common mistakes you see after trial and error})
Text: {few-shot example}
Reason for Label: {explanation}
Label: {correct label}
Input:
Text: {Text for it to label}
Label (Print Yes/No Only):
Response:
For experimentation, I found that discrepancies are your best friend. My setup was:
Create baseline labels, you don't care at this point how accurate they are - I think few-shot w/ 5 examples and no hints are the way to go here because you want the model to fail
If you use Mixtral8x7b using the prompt format above, you will 100% get Yes/No labels + it's justification, so you can just quickly sample 10 outputs to see how it did and make notes of common mistakes to make your hint
Run the model again, include a hint to your prompt, and then look specifically at the discrepancies -- you should be able to instantly tell if the baseline is overfitting for false positives or false negatives, that's kind of your goal
As you iterate through your instruction, hints, and few-shot examples, you want to continue to look at the discrepancies, your goal should be to get it to decrease little by little, so that by the time you done, your prompt will correct all the mislabels.
Adding MORE few-shot examples will exaggerate the overfitting, you want to do this so you can quickly see if your model leans towards false positives or negatives
I wrote a script that output something like this:
Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv:
Same: 900, Different: 100
Number of times M8x7b-t0 said \"Yes\" and M8x7b-t1 said \"No\": 100
Number of times M8x7b-t0 said \"No\" and M8x7b-t1 said \"Yes\": 0
That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million...
Eventually I got it down to this:
Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv:
Same: 972, Different: 28
Number of times M8x7b-t1 said \"Yes\" and M8x7b-t2 said \"No\": 2
Number of times M8x7b-t1 said \"No\" and M8x7b-t2 said \"Yes\": 26
When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels.
Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other.
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
I ran into this too. When fine-tuning, what you need to do is provide some subset of training data where you explicitly return nothing for false positives. In my data, I set this to about ~10% of the total and the problem disappeared.
Reply
GeeBrain
• 19d ago
Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points
reply
Can you please provide an example of an actual prompt?
GeeBrain
• 18d ago
It's literally the template + whatever you want in the {}. But here ya go...
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction: Label the comment based on this question: \"Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label.
(Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!)
Comment: Wow, you are so beautiful.
Reason for Label: Sharing simple statements admiration or opinions, does not count as disclosing personal details, they need to express something about their personal life, habits, or experiences.
Label: No
.... (More examples)
Input:
Comment: \"When he comes up?\"
Label (Print Yes/No Only):
Response:
Reply
reply
trapping_rainwater
• 18d ago
What's your production use case for something like this?
Reply
reply
GeeBrain
• 18d ago
My project is around building an ML model that measures trust — kinda like a fandom score.
But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it.
Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun"
456: Baseline benchmark for 17 coding models : r/LocalLLaMA
### DetailsSimilarity score: 0.88
- [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)
Baseline Benchmark for 17 Coding Models
=========================================
Discussion
----------
I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.
**Notes:**
* This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
* I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
* AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
* Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
* Each model was prompted according to the model card template. Here's an example for the codellama series -
```python
f"""You are a helpful and respectful assistant. Answer the following question: {question}"""
```
Results
-------
I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots [here](https://imgur.com/a/autpnfK).
| Model | Temp | Correct / 164 | Percentage |
| --- | --- | --- | --- |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.0 | 67 | 0.40853658536585363 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.1 | 63 | 0.38414634146341464 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.2 | 68 | 0.4146341463414634 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.3 | 61 | 0.3719512195121951 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.4 | 61 | 0.3719512195121951 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.5 | 63 | 0.38414634146341464 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.6 | 54 | 0.32926829268292684 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.7 | 61 | 0.3719512195121951 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.8 | 60 | 0.36585365853658536 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.9 | 59 | 0.3597560975609756 |
| TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 1.0 | 65 | 0.39634146341463417 |
#### Suggested labels
#### { "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }
309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"
### DetailsSimilarity score: 0.86
- [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)
HumanEval: Hand-Written Evaluation Set
This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".
Installation
Make sure to use python 3.7 or later:
$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage
This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:
{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.
Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.
from human_eval.data import write_jsonl, read_problems
problems = read_problems()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples, run
$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}
This script provides more fine-grained information in a new file ending in _results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".
As a quick sanity-check, the example samples should yield 0.5 pass@1.
$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00, 1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}
Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see
$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.
Known Issues
While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.
malloc: can't allocate region
Citation
Please cite using the following bibtex entry:
@article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
#### Suggested labels
#### { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }
134: marker: Convert PDF to markdown quickly with high accuracy
### DetailsSimilarity score: 0.86
- [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)
## Marker
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
- Support for a range of PDF documents (optimized for books and scientific papers)
- Removes headers/footers/other artifacts
- Converts most equations to latex
- Formats code blocks and tables
- Support for multiple languages (although most testing is done in English). See `settings.py` for a language list.
- Works on GPU, CPU, or MPS
More Details
## How it works
Marker is a pipeline of deep learning models:
- Extract text, OCR if necessary (heuristics, tesseract)
- Detect page layout ([layout segmenter](https://huggingface.co/vikp/layout_segmenter), [column detector](https://huggingface.co/vikp/column_detector))
- Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base))
- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor_t5))
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.` In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.
Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.
## Examples
| PDF | Type | Marker | Nougat |
|-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkpython.md) |
| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkos.md) |
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) |
## Performance
![Benchmark overall](data/images/overall.png)
The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
# Limitations
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
- Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
- Whitespace and indentations are not always respected.
- Not all lines/spans will be joined properly.
- Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
# Installation
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
First, clone the repo:
- `git clone https://github.com/VikParuchuri/marker.git`
- `cd marker`
## Linux
- Install system requirements
- Optional: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/) or running `scripts/install/tesseract_5_install.sh`.
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
- Install other requirements with `cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y`
- Set the tesseract data folder path
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
- Install python requirements
- `poetry install`
- `poetry shell` to activate your poetry venv
- Update pytorch since poetry doesn't play nicely with it
- GPU only: run `pip install torch` to install other torch dependencies.
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
## Mac
- Install system requirements from `scripts/install/brew-requirements.txt`
- Set the tesseract data folder path
- Find the tesseract data folder `tessdata` with `brew list tesseract`
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
- Install python requirements
- `poetry install`
- `poetry shell` to activate your poetry venv
# Usage
First, some configuration:
- Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
- Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
- By default, the final editor model is off. Turn it on with `ENABLE_EDITOR_MODEL`.
- By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the `OCR_ENGINE` setting.
## Convert a single file
Run `convert_single.py`, like this:
```
python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10
```
- `--parallel_factor` is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
Make sure the `DEFAULT_LANG` setting is set appropriately for your document.
## Convert multiple files
Run `convert.py`, like this:
```
python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
```
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
- `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
- `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is:
- `--min_length` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
```
{
"pdf1.pdf": {"language": "English"},
"pdf2.pdf": {"language": "Spanish"},
...
}
```
## Convert multiple files on multiple GPUs
Run `chunk_convert.sh`, like this:
```
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
```
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
- `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
- `MIN_LENGTH` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
# Benchmarks
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
**Speed**
| Method | Average Score | Time per page | Time per document |
|--------|---------------|---------------|-------------------|
| naive | 0.350727 | 0.00152378 | 0.326524 |
| marker | 0.641062 | 0.360622 | 77.2762 |
| nougat | 0.629211 | 3.77259 | 808.413 |
**Accuracy**
First 3 are non-arXiv books, last 3 are arXiv papers.
| Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf |
|--------|------------------|-----------|-----------------|-------------|--------------|-----------------|
| naive | 0.244114 | 0.140669 | 0.0868221 | 0.366856 | 0.412521 | 0.468281 |
| marker | 0.482091 | 0.466882 | 0.537062 | 0.754347 | 0.78825 | 0.779536 |
| nougat | 0.696458 | 0.552337 | 0.735099 | 0.655002 | 0.645704 | 0.650282 |
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
**Throughput**
Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.
![Benchmark results](data/images/per_doc.png)
## Running your own benchmarks
You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
Then run `benchmark.py` like this:
```
python benchmark.py data/pdfs data/references report.json --nougat
```
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
# Commercial usage
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
Here are the non-commercial/restrictive dependencies:
- LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base)
- Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
- PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
# Thanks
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
- Nougat from Meta
- Layoutlmv3 from Microsoft
- DocLayNet from IBM
- ByT5 from Google
Thank you to the authors of these models and datasets for making them available to the community!
### #369: "You are a helpful AI assistant" : r/LocalLLaMA
### DetailsSimilarity score: 0.86
- [ ] ["You are a helpful AI assistant" : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/18j59g1/you_are_a_helpful_ai_assistant/?share_id=g_M0-7C_zvS88BCd6M_sI&utm_content=1&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1)
"You are a helpful AI assistant"
Discussion
I've been stumbling around this sub for awhile, testing all the small models and preaching the good word of the omnipotent OpenHermes. Here's some system prompt tips I've picked up:
Don't say "don't": this confuses them, which makes sense when you understand how they "think". They do their best to string concepts together, but they simply generate the next word in the sequence from the context available. Saying "don't" will put everything following that word into the equation for the following words. This can cause it to use the words and concepts you're telling it not to.
Alternative: try to use "Only" statements. Instead of "Don't talk about any other baseball team besides the New York Yankees" say "Only talk about the New York Yankees".
CAPITALIZING INSTRUCTIONS: For some reason, this works when used sparingly, it even makes some models pay attention to "don't". Surprisingly, this seems to work with even ChatGPT. It can quickly devolve your system prompt into confused yelling if you don't limit it, and can even cause your model to match the format and respond with confused yelling, so really only once or twice on important concepts.
\n: A well formated system prompt goes a long way. Splitting up different sections with a line break makes a noticeable improvement in comprehension of the system prompt by the model. For example, here is my format for LMStudio:
" Here is some information about the user: (My bio)
(system prompts)
Here is some context for the conversation: (Paste in relevant info such as web pages, documentation, etc, as well as bits of the convo you want to keep in context. When you hit the context limit, you can restart the chat and continue with the same context).
"You are a helpful AI assistant" : this is the demo system prompt to just get agreeable answers from any model. The issue with this is, once again, how they "think". The models can't conceptualize what is helpful beyond agreeing with and encouraging you. This kind of statement can lead to them making up data and concepts in order to agree with you. This is extra fun because you may not realize the problem until you discover for yourself the falacy of your own logic.
Think it through/Go over your work: This works, but I think it works because it directs attention to the prompt and response. Personally, I think there's better ways to do this.
Role assignment: telling it to act as this character or in that role is obviously necessary in some or even most instances, but this can also be limiting. It will act as that character, with all the limits and falacies of that character. If your waifu can't code, neither will your AI.
Telling it to be confident: This is a great way to circumvent the above problem, but also runs the risk of confident hallucinations. Here's a 2 prompt trick I use:
Tell one assistant to not answer the user prompt, but to simply generate a list of facts, libraries, or research points from its own data that can be helpful to answering the prompt. The prompt will be answered by the same model LLM, so write the list with the same model LLM as the future intended audience instead of a human.
Then pass the list to your assistant you intend to chat with with something like "you can confidently answer in these subjects that you are an expert in: (the list).
The point of this ^ is to limit its responses to what it actually knows, but make it confidentially answer with the information it's sure about. This has been incredibly useful in my cases, but absolutely check their work.
#### Suggested labels
#### { "key": "sparse-computation", "value": "Optimizing large language models using sparse computation techniques" }
### #499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.
### DetailsSimilarity score: 0.85
- [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)
# CTransformers
[![PyPI version](https://badge.fury.io/py/ctransformers.svg)](https://badge.fury.io/py/ctransformers)
[![Documentation](https://readthedocs.org/images/button/readthedocs-ci.svg)](https://ctransformers.readthedocs.io/)
[![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see [ChatDocs](https://github.com/marella/chatdocs)
## Supported Models
| Model | Model Type | CUDA | Metal |
| ------ | --------- | :--: | :--: |
| GPT-2 | gpt2 | | |
| GPT-J, GPT4All-J | gptj | | |
| GPT-NeoX, StableLM | gpt_neox | | |
| Falcon | falcon | ✅ | |
| LLaMA, LLaMA 2 | llamai | ✅ | ✅ |
| MPT | mpt | ✅ | |
| StarCoder, StarChat | gpt_bigcode | ✅ | |
| Dolly V2 | dolly-v2 | | |
| Replit | replit | | |
## Installation
To install via `pip`, simply run:
```
pip install ctransformers
```
## Usage
It provides a unified interface for all models:
```python
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to"))
```
Run in Google Colab
To stream the output:
```python
for text in llm("AI is going to", stream=True):
print(text, end="", flush=True)
```
You can load models from Hugging Face Hub directly:
```python
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")
```
If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using:
```python
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")
```
### 🤗 Transformers
Note: This is an experimental feature and may change in the future.
To use with 🤗 Transformers, create the model and tokenizer using:
```python
from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
```
Run in Google Colab
You can use 🤗 Transformers text generation pipeline:
```python
from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))
```
You can use 🤗 Transformers generation parameters:
```python
pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)
```
You can use 🤗 Transformers tokenizers:
```python
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo.
```
### LangChain
It is integrated into LangChain. See LangChain [docs](https://github.com/LangChainAI/langchain#using-ctransformers-backed-models).
### GPU
To run some of the model layers on GPU, set the `gpu_layers` parameter:
```python
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)
```
Run in Google Colab
#### CUDA
Install CUDA libraries using:
```bash
pip install ctransformers[cuda]
```
#### ROCm
To enable ROCm support, install the `ctransformers` package using:
```bash
CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers
```
#### Metal
To enable Metal support, install the `ctransformers` package using:
```bash
CT_METAL=1 pip install ctransformers --no-binary ctransformers
```
### GPTQ
Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).
Install additional dependencies using:
```bash
pip install ctransformers[gptq]
```
Load a GPTQ model using:
```python
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
```
Run in Google Colab
If the model name or path doesn't contain the word `gptq`, specify `model_type="gptq"`.
It can also be used with LangChain. Low-level APIs are not fully supported.
## Documentation
Find the documentation on [Read the Docs](https://ctransformers.readthedocs.io/).
#### Config
| Parameter | Type | Description | Default |
| --------- | ------ | ------------------------------------------------------------ | ------- |
| `top_k` | `int` | The top-k value to use for sampling | `40` |
| `top_p` | `float` | The top-p value to use for sampling | `0.95` |
| `temperature` | `float` | The temperature to use for sampling | `0.8` |
| `repetition_penalty` | `float` | The repetition penalty to use for sampling | `1.1` |
| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty | `64` |
| `seed` | `int` | The seed value to use for sampling tokens | `-1` |
| `max_new_tokens` | `int` | The maximum number of new tokens to generate | `256` |
| `stop` | `List` | A list of sequences to stop generation when encountered | `None` |
| `stream` | `bool` | Whether to stream the generated text | `False` |
| `reset` | `bool` | Whether to reset the model state before generating text | `True` |
| `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt | `8` |
| `threads` | `int` | The number of threads to use for evaluating tokens | `-1` |
| `context_length` | `int` | The maximum context length to use | `-1` |
| `gpu_layers` | `int` | The number of layers to run on GPU | `0` |
Find the URL for the model card for GPTQ [here](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq).
---
Made with ❤️ by [marella](https://github.com/marella)
#### Suggested labels
#### null
### #626: classifiers/README.md at main · blockentropy/classifiers
### DetailsSimilarity score: 0.85
- [ ] [classifiers/README.md at main · blockentropy/classifiers](https://github.com/blockentropy/classifiers/blob/main/README.md?plain=1)
# classifiers/README.md
## Fast Classifiers for Prompt Routing
Routing and controlling the information flow is a core component in optimizing machine learning tasks. While some architectures focus on internal routing of data within a model, we focus on the external routing of data between models. This enables the combination of open source, proprietary, API based, and software based approaches to work together behind a smart router. We investigate three different ways of externally routing the prompt - cosine similarity via embeddings, zero-shot classification, and small classifiers.
## Implementation of Fast Classifiers
The `code-class.ipynb` Jupyter notebook walks through the process of creating a fast prompt classifier for smart routing. For the fast classifiers, we utilize the model [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert), a smaller language representation model designed for efficient on-the-edge operation and training under computational constraints. DistilBERT is not only less costly to pre-train but also well-suited for on-device computations, as demonstrated through experiments and comparative studies.
We quantize the model using [Optimum](https://huggingface.co/docs/optimum/index), enabling the model to run extremely fast on a CPU router. Each classifier takes 5-8ms to run. An ensemble of 8 prompt classifiers takes about 50ms in total. Thus, each endpoint can route about 20 requests per second.
In the example `code-class`, we are deciding between prompts of code and not code prompts. The two datasets used are the 52K [instruction-following data](https://arxiv.org/abs/2304.03277) generated by GPT-4 with prompts in Alpaca. And the 20K instruction-following data used for fine-tuning the [Code Alpaca](https://github.com/sahil280114/codealpaca) model.
Train test split of 80/20 yields an accuracy of 95.49% and f1 score of 0.9227.
![Train Test](./traintest.png)
## Comparison vs other Routing methods
The most popular alternative to routing is via embedding similarity. For example, if one were to try to route a programming question, one might set up the set of target classes as ["coding", "not coding"]. Each one of these strings is then transformed into an embedding and compared against a prompt query like, "write a bubble sort in python". Given the computed pair-wise cosine similarity between the query and class, we can then label the prompt as a coding question and route the prompt to a coding-specific model. These do not scale well with larger numbers of embeddings. Nor are they able to capture non-semantic type classes (like is the response likely to be more or less than 200 tokens). However, they are adaptable and comparably fast and thus provide a good alternative to the trained fast classifiers.
![Train Test](./graphs.png)
Quantifying different methods of routing in terms of execution time. As the prompt size increases, the query time also increases as shown in (a). There is also a close to linear increase in the time as the number of classes increase as shown in (b). However, the small classifiers do not increase in time as the class examples increase in the number of tokens (c). This is due to the upfront cost of training the binary classifier, reducing cost at inference.
## Reproducibility
The `timing_tests.js` and `complexity.js` files can be used for reproducibility. Note that only the code classifier is currently available in this repo. One will need to install the appropriate models from the [Transformers.js](https://huggingface.co/docs/transformers.js/en/index) repo.
[View on GitHub](https://github.com/blockentropy/classifiers/blob/main/README.md?plain=1)
#### Suggested labels
#### {'label-name': 'Prompt-Routing', 'label-description': 'Focuses on external routing of data between models to optimize machine learning tasks.', 'confidence': 50.24}
TITLE
I finally got perfect labels (classification task) via prompting : r/LocalLLaMA
DESCRIPTION
"I finally got perfect labels (classification task) via prompting
Tutorial | Guide It took me weeks of trial and error, but here are my biggest lessons:
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral Split your prompt into 3 sections:
Below is the plug-n-play template I finalized/am using Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction:
Label the text based on this question: \"{task}\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label. (Hint: {common mistakes you see after trial and error})
Text: {few-shot example} Reason for Label: {explanation} Label: {correct label}
Input:
Text: {Text for it to label} Label (Print Yes/No Only):
Response:
For experimentation, I found that discrepancies are your best friend. My setup was:
Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv: Same: 900, Different: 100
Number of times M8x7b-t0 said \"Yes\" and M8x7b-t1 said \"No\": 100 Number of times M8x7b-t0 said \"No\" and M8x7b-t1 said \"Yes\": 0
That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million...
Eventually I got it down to this:
Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv: Same: 972, Different: 28
Number of times M8x7b-t1 said \"Yes\" and M8x7b-t2 said \"No\": 2 Number of times M8x7b-t1 said \"No\" and M8x7b-t2 said \"Yes\": 26
When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels.
Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other.
monday.com Sign Up
Sort by:
aichiusagi • 19d ago • Edited 18d ago
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral I ran into this too. When fine-tuning, what you need to do is provide some subset of training data where you explicitly return nothing for false positives. In my data, I set this to about ~10% of the total and the problem disappeared.
Reply
GeeBrain • 19d ago
Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points
reply
Can you please provide an example of an actual prompt?
GeeBrain • 18d ago
It's literally the template + whatever you want in the {}. But here ya go...
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.
Instruction: Label the comment based on this question: \"Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label.
(Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!)
Comment: Wow, you are so beautiful.
Reason for Label: Sharing simple statements admiration or opinions, does not count as disclosing personal details, they need to express something about their personal life, habits, or experiences.
Label: No
.... (More examples)
Input:
Comment: \"When he comes up?\"
Label (Print Yes/No Only):
Response:
Reply
reply
trapping_rainwater • 18d ago
What's your production use case for something like this?
Reply
reply
GeeBrain • 18d ago
My project is around building an ML model that measures trust — kinda like a fandom score.
But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it.
Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun"
URL
Suggested labels