mikaelhaji / n1-codec

a highly efficient compression algorithm for the n1 implant (neuralink's compression challenge)
45 stars 6 forks source link

The N1 Codec

An implementation of 24+ lossless compression algorithms from dictionary-based codecs to specialized audio codecs tailored for Neuralink's N1 Implant, achieving compression ratios up to a high of 3.11x.

The Neuralink Compression Challenge is a challenge announced by Bliss Chapman aimed at drastically reducing the data bandwidth requirements for the N1 implant without losing a single sample. This implant, situated in the motor cortex of a non-human primate, generates around 200 Mbps of data from 1024 electrodes, each sampling at 20 kHz with 10-bit resolution.

To transmit this data wirelessly in real-time, a compression ratio greater than 200x is necessary, all while operating under stringent conditions of less than 1 ms latency and power consumption below 10 mW.

The goal of this project was to deterministically get a sense of how well-established lossless compression techniques could effectively compress this unique type of data.

[!IMPORTANT]

There were 5 groups that these algorithms were categorized in based on their core methodologies and performance characteristics:

  1. Dictionary-Based Codecs (zstd, lz4, lz4hc, gzip, zlib, bz2 etc.)
  2. Predictive Codecs (delta etc.)
  3. Entropy Codecs (huffman, arithmetic etc.)
  4. Specialized Audio Codecs (wavpack etc.)
  5. Blosc Hybrid Codecs

Despite the common application of these algorithms in various data types, little work has been done to investigate how these compression techniques perform specifically on electrophysiology data.

Holistic Overview of Algorithms Deployed:

24+ lossless compression algorithms were tested & benchmarked against each other.

Compression_Ratios_Scatter_Plot

Table of Contents

Build Instructions

Repository Structure

How to Run

Setup Instructions

To install all dependencies, simply run:

make install

[!NOTE]

For WavPack, simply run:
1) Install the WavPack Library
a) brew install wavpack (Mac)
b) sudo apt-get install wavpack (Debian Based Systems) 2) Go to Home directory in repo & cd /n1-codec/misc/wavpack-numcodecs 3) Install directory: pip install .

Execution Commands

To Evaluate a Single Algorithm's Compression:

  1. Ensure that mode = insert algo mode here for both encode.py and decode.py
  2. Run make encode to make the encode executable
  3. Run make decode to make the decode executable
  4. Run the eval.sh bash script by running make eval

To Evaluate a Series of Algorithm's Compression:

  1. Ensure that mode = sys.argv[3] for both encode.py and decode.py
  2. Run make encode to make the encode executable
  3. Run make decode to make the decode executable
  4. Run the grid_results.sh bash script by running make grid

Data Analysis

The fundamental characteristics of this dataset are as follows:

This dataset presents substantial challenges due to its size and complexity, necessitating sophisticated compression strategies to reduce bandwidth effectively while preserving the integrity and quality of the neural signals.

Desired Compression Ratio of 200x ....

The Neuralink Compression Challenge set a formidable goal of achieving a 200x lossless compression ratio. This target, while theoretically appealing, presents substantial practical challenges given the inherent noise and complexity of electrophysiological data.

In an article a friend and I wrote last year reviewing all of Neuralink's patents and papers, we delved deep into what Neuralink had deployed in order to get the most important features from their data.

Here's a long but insightful excerpt on the compression tactics (w. loss) that Neuralink has deployed:

Compression strategies predominantly involve applying thresholds to detect spikes in a specific range, summary statistics like channel-wise averages, and/or event-based triggers off-chip. Alternatively, information-theoretic lossless compression techniques like PNG, TIFF, or ZIP may be used. In some examples, the reduction in bandwidth from the compression engine can exceed 1,000 times fewer data.

These thresholds may be set on the voltage of the signal or the frequency of the signal. Low-frequency and high-frequency signals may not be valuable to the recorder and can be filtered out by the compression engine. Non-spike signals are discarded, essentially reducing the size of the data packets, and compressing the signal. For voltage-based thresholds, a technique called non-linear energy operator (NEO) may be used to automatically find a threshold that accurately detects spikes.

Briefly reviewing NEO, it essentially filters the signals for the periods at which there are fast frequency and amplitude changes of spikes, which can be seen as short peaks in the NEO filtered output.

$$ \psi[x(n)] = |x(n) \cdot x(n)| - |x(n-1) \cdot x(n+1)| $$

NEO, represented by 𝝍[x(n)], of a signal x(n) can be computed as shown above. It simply compares the deviation between the signal at n time step and the signal at n-1 and n+1 time steps.

$$ \text{Thr} = C \times \frac{1}{N} \sum_{n=1}^{N} \psi[x(n)] $$

Furthermore, a threshold for NEO detection can be calculated as the mean of the NEO filtered output multiplied by a factor C. In this equation, N is the number of samples of the signal. C is found empirically and should be tested on several neural datasets beforehand to achieve the best results.

Both the compression engine and controller play a crucial role in throttling the amount of data being generated by each chip. Throttling allows for power and performance efficiency improvements for the N1 system.

Alternatively, during the Neuralink launch event, DJ Seo introduced a novel on-chip spike detection algorithm that involved directly characterizing the shape of a spike. This method is able to compress neural data by more than 200x and only takes 900 nanoseconds to compute, which is faster than the time it takes for the brain to realize it happened. This technique even allows for identifying different neurons from the same electrode based on shape.

Dataset Visualization

1/ the data is inconsistent & noisy

The global amplitude statistics for the WAV files are as follows:

The data exhibits substantial variability, highlighted by a large standard deviation indicating wide fluctuations in signal amplitude, and a leptokurtic distribution with a high kurtosis value suggesting data points are densely clustered around the mean with frequent extreme values. Despite a skewness value near zero indicating symmetry, the mode significantly diverges from the mean and median, underscoring the presence of notable outliers. This combination of statistics suggests a highly variable dataset with a complex, outlier-influenced structure.

2/ mid-range spectral entropy of the data -- yet also extremely variable across files


The spectral entropy of the electrophysiology data averages at 4.88, indicating a moderately complex and unpredictable spectral distribution. The standard deviation of 1.16 points to significant variability across files, which complicates achieving consistent compression ratios.

3/ very noisy, random spectogram

The Suite of Lossless Algorithms

[!NOTE]

As a reminder, there were 5 groups that these algorithms were categorized in based on their core methodologies and performance characteristics:

  1. Dictionary-Based Codecs (zstd, lz4, lz4hc, gzip, zlib, bz2 etc.)
  2. Predictive Codecs (delta etc.)
  3. Entropy Codecs (huffman, arithmetic etc.)
  4. Specialized Audio Codecs (wavpack etc.)
  5. Blosc Hybrid Codecs

Benchmarking Metrics

The following variables were measured in order to effectively benchmark the algorithms against one another:

Results

Compression_Decompression_Comparison

insights/thoughts:

B2Z Algorithm:

Compressing Data with BZ2

The BZ2 compressor utilizes the Burrows-Wheeler Transform (BWT) followed by the Move-to-Front (MTF) Transform and Huffman coding. Here's how each step contributes to the compression process:

Burrows-Wheeler Transform (BWT):

$$ \text{BWT}(S) = \text{last column of sorted cyclic permutations of } S $$

Move-to-Front (MTF) Transform:
Huffman Coding:

Decompressing Data with BZ2

Decompression reverses the compression steps:

Compression Level Impact

The compression level in BZ2 can be adjusted, typically ranging from 1 to 9. A higher compression level increases the compression ratio but also the computational expense:

$$ \text{Compression Time} \propto L \times \text{Data Complexity} $$

Delta-Huffman Algorithm:

Delta Huffman Coding combines the principles of delta encoding and Huffman coding to effectively compress data, especially beneficial for time-series or signal data where consecutive samples are often similar.

Building the Huffman Tree

Generating Huffman Codes

Compressing Data with Canonical Huffman Codes

$$ \text{Encoded Data Length} = \sum (\text{Code Length of Symbol} \times \text{Frequency of Symbol}) $$

Decompressing Data

Serialization of Huffman Codes

Next Steps

Open for Pull Requests (PRs) 👋

Feel free to contribute to the N1 Codec project! If you have ideas or improvements, just submit a PR. Here are some areas where your contributions could really help: