Request: Optimize GGUF Models for Apple Neural Engine

antmikinka commented 11 months ago

To help with computation time on other devices for iOS. The models should run on the Apple Neural Engine. Combined with ANE computation and model chunking, implementing dynamic embeddings from EELBERT: Tiny Models through Dynamic Embeddings would really optimize models for devices. (Can make model smaller by 15x, while being within 4% of the non-optimized model benchmark for GLUE)

Another developer that has a great ANE project, you two should try and collaborate More Neural Engine Transformers.

I am not sure if this is possible in Swift, but in CoreML there is multiple types of Weight Compression. Adding this with the above technologies really gets large 34B+ parameter models on mobile devices.

Pruning
Palettization
Linear 8-bit Quantization
Mixed-Bit Palettization

antmikinka commented 11 months ago

Latest Research Papers from Apple. I would say a lot of people are not looking into these. I am not trying to spam nor be annoying. I would love to see AI be more available for people with less computational resources, such as myself.

I would love to help with LLMFarm and implement these.

Additional Technologies that may be implemented into LLMFarm that may take a GGUF model, convert & create an optimized model utilizing the below technologies.

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models "Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs."

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model "Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias - the difference between how a model is trained and how it is used during inference. Denoising diffusion models provide an alternative approach in which a model can revisit and revise its output. However, they can be computationally expensive, and prior efforts on text have led to models that produce less fluent output compared to autoregressive models, especially for longer text and paragraphs. In this paper, we propose PLANNER, a model that combines latent semantic diffusion with autoregressive generation, to generate fluent text while exercising global control over paragraphs. The model achieves this by combining an autoregressive "decoding" module with a "planning" module that uses latent diffusion to generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed method is evaluated on various conditional generation tasks, and results on semantic generation, text completion, and summarization show its effectiveness in generating high-quality long-form text in an efficient manner."

HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion "Implicit neural fields, typically encoded by a multilayer perceptron (MLP) that maps from coordinates (e.g., xyz) to signals (e.g., signed distances), have shown remarkable promise as a high-fidelity and compact representation. However, the lack of a regular and explicit grid structure also makes it challenging to apply generative modeling directly on implicit neural fields in order to synthesize new data. To this end, we propose HyperDiffusion, a novel approach for unconditional generative modeling of implicit neural fields. HyperDiffusion operates directly on MLP weights and generates new neural implicit fields encoded by synthesized MLP parameters. Specifically, a collection of MLPs is first optimized to faithfully represent individual data samples. Subsequently, a diffusion process is trained in this MLP weight space to model the underlying distribution of neural implicit fields. HyperDiffusion enables diffusion modeling over a implicit, compact, and yet high-fidelity representation of complex signals across 3D shapes and 4D mesh animations within one single unified framework."

ShawnFumo commented 10 months ago

I could be wrong, but I think this project is mostly a wrapper around the ggml projects like llama.cpp. I think if those papers were implemented over there, it'd be much more likely to make it into apps like this.

guinmoon / LLMFarm

Request: Optimize GGUF Models for Apple Neural Engine #18