Data volume reduction on the client -data types and indices

miranov25 commented 1 month ago

To facilitate data transfer between the server and client, user-defined lossy and lossless compression is employed.

The techniques discussed below can reduce the data volume on the client. The impact on the speed of subsequent data manipulation should be evaluated:

For integers, use the smallest representation possible—this can be included as an option.
For floats compressed with absolute precision, an integer array multiplied by a step size can be utilized.
For floats compressed relatively, a lookup table (either 8-bit or 16-bit) can be employed.
Use bit precision for integer representation, such as 1-bit, 2-bit, or 3-bit numbers.
Often, group-by statistics are employed to calculate various statistics (e.g., mean, median, RMS, entries, fit parameters) by group. In such cases, an index to another table can be used instead of flattening the data.

It is unclear how swiftly specific functionalities can be implemented in JavaScript, the potential slowdown in code performance, and the complexity it might introduce. A subset of these features should be implemented promptly, while the rest could likely be addressed using WebAssembly.

miranov25 commented 1 month ago

N-bit Integer Representation

An array of 1, 2, 3, ... N-bit integers can be represented using a bit array.

Consider an array of M numbers, each with N bits. To represent them, you can use a Uint8Array of size (N * M) / 8.

How efficient is random access in such an array in JavaScript?

pl0xz0rz commented 4 weeks ago

For integers, use the smallest representation possible—this can be included as an option. -- Already done two years ago, improved performance a lot
For floats compressed with absolute precision, an integer array multiplied by a step size can be utilized. -- Didn't have time to implement this for use in histograms, but they are already sent to the client as integers
For floats compressed relatively, a lookup table (either 8-bit or 16-bit) can be employed. -- This would probably make the performance worse because of cache locality
Use bit precision for integer representation, such as 1-bit, 2-bit, or 3-bit numbers. -- Doesn't seem very useful, packed representation is slow
Often, group-by statistics are employed to calculate various statistics (e.g., mean, median, RMS, entries, fit parameters) by group. In such cases, an index to another table can be used instead of flattening the data. -- Half implemented, but with three issues, bad interface (to be improved to parity with RDataFrame interface), bugs (don't remember what exactly they were, but remember that there were some bugs) and automatically expanding the new columns - probably can be avoided by computing these after downsampling in case of drawing with downsampling, and fusing this operation with histogram in case of histogram in the easy cases - not sure if streaming will work however

What is not mentioned here but should help improve memory consumption: -Bitmasks take up too much space - while not that much of an issue if there's more columns than histograms, reducing this by a factor of ~2 should be simple

miranov25 commented 3 weeks ago

Bitmasks for Widget Selection

This section explains the bitmask representation used for widget selection, which is ideal for inclusion in READMEdeveloper.md or similar documentation.

Definition

N_points: Total number of points.
N_widgets: Total number of widgets.

We utilize a bitmask representation where each widget and point is represented by a bit.

Memory Allocation

Ideally, we should allocate an array of 32 bits for the product of N_points and int(N_widgets/32):

For each point, the value is a bitmask as follows: (w0, w1, w2, ..., wN).

Selection Logic

If N_widgets < 32: A point is selected if bitmask[ipoints] = bitmask_accept. Typically, all bits should be set to 1 if we use a logical AND operation.
If N_widgets > 32: A range of indices should be checked to determine selection.

Implementation Differences (described above and new Marian pull request)

In the existing implementation by Marian, a different logic is used where the columns are bitmask per widget, sized at N_points/32.

Additional Memory Requirements for Special Functions

funCustom: The memory size required is Npoints x 64 bits (float).
customSelection: The size is Npoints x 32 bits (int), which needs to be converted to boolean. Additionally, this is added to the bitmasks.

The advantage of Marian’s column-per-widget implementation lies in its update method, which modifies only the corresponding column of the bitmask when a single widget changes. In the row bitmask representation, the decision-making process is swift and does not require a special column. However, since this column is small, the memory savings are minimal (N_points x 32 bits).

miranov25 commented 3 weeks ago

Custom logic imlementation - for later implementation (for later)

Currently the selection is only logical && of selections from differnt widgets

(s0 & s1 &s2 & ...) In gernaral we would like tp cofigure also custom login e.g
( (s0 |s1) &s2 )

In old PAW system configuration was defiend as a custom user defiend string - as shown above. For us it means that widgets needs IDs (as string) and in cusom logic form this IDS could be used .

miranov25 commented 3 weeks ago

refer to #96

miranov25 commented 3 weeks ago

Optimization of the funCustom - needed to reduce need for exanded functions

Title: Automatically Generate Function Header for JavaScript Function Based on Used Variables

Description:

We have a list of all variables present in a table and a JavaScript function (as a string without the header) that uses a subset of these variables. Our goal is to automatically generate the function header for this JavaScript function, including only the variables that are actually used in the function.

This is needed as we would like to keep columns compressed and use (expand to) float representation only if needed.

Use Case:

Input:
- A list of all variables (columns).
- A JavaScript function as a string (functionString).
Output:
- A function header that includes only the variables from the list that are used in the function string.

Proposed Solution:

We will implement an algorithm to achieve this by iterating through the list of variables and checking if each variable is present in the function string. If a variable is found in the function string, it will be added to the function header.

Pseudocode:

function generateFunctionHeader(columns, functionString):
    Initialize an empty list `columnArgs`

    for each column in columns:
        if column is present in functionString:
            add column to `columnArgs`

    Create the function header string `header`
    header = "function XXX(" + join(columnArgs, ", ") + ") {"

    return header

JavaScript Code:

function generateFunctionHeader(columns, functionString) {
    let columnArgs = [];

    columns.forEach(column => {
        if (functionString.includes(column)) {
            columnArgs.push(column);
        }
    });

    let header = "function XXX(" + columnArgs.join(", ") + ") {";

    return header;
}

// Example usage:
let columns = ["v0", "v1", "v2", "v3", "v4"];
let functionString = "let sum = v0 + v1; let product = v1 * v3; return sum + product;";

let functionHeader = generateFunctionHeader(columns, functionString);
console.log(functionHeader);  // Output: "function XXX(v0, v1, v3) {"

miranov25 commented 2 weeks ago

Goal:

Allow the data source to handle a large number of columns.
The current limitation is in memory capacity, not CPU processing power.
Transition to using smaller float representations to reduce memory usage significantly. Currently, floats are represented in 64 bits, but with compression, we can achieve 8-bit or 16-bit quantization, which represents a memory reduction factor of 8 or 4.
A potential drawback is that transformations (such as selections and custom functions) may be slower. We need to quantify how CPU time scales with the number of rows and the number of columns.
10^6-10^7 crows * 10^2 columns -> 0.1-1 GBY in case of 1Byte representation - or 6-64 GBy with Float64 representation

Node.js Speed Benchmark:

To evaluate the use of compressed float representations for data manipulation, we should compare CPU usage across different types of operations:

Representations:

Float 64: Standard double-precision floating-point.
Index Representation:
- Absolute Compression: Value = val_i * step.
- Relative Rounding: Value = ordered_lookup(i).
- Lookup (e.g., for group by values).

Operations:

Selection: Can work with indices. Ranges can be transformed in the case of ordered data.
Standard Operations: Addition, subtraction, multiplication, and division by scalars and vectors.
Functions: Trigonometric functions like sine and cosine.
Histogramming: Evaluating the distribution of data.

Benchmark:

Measure CPU time per operation for different representations as function on ncols and nrows

miranov25 commented 2 weeks ago

Node.js vs Browser CPU Benchmark Consideration

Marian's Observation-Statement

"When I look at performance, there is quite a big difference between the browser and Node.js; in some code pairs, one is faster in the browser and the other in Node.js. Therefore, I believe it is more realistic to perform benchmarks in the browser and restrict Node.js to just unit tests."

Marian, could you please provide some examples? Are the differences a matter of percentage points or orders of magnitude?

Proposal for Continuing CPU Benchmarks in Node.js

I suggest we maintain CPU benchmarks for the numerical parts of the code in Node.js as well, since it is easier to automate and compare. For regression testing, this will be very important. We should soon move more towards WebAssembly, where I expect the results will be more predictable.

miranov25 / RootInteractive