miranov25 / RootInteractive

4 stars 12 forks source link

Data volume reduction on the client -data types and indices #355

Open miranov25 opened 1 month ago

miranov25 commented 1 month ago

To facilitate data transfer between the server and client, user-defined lossy and lossless compression is employed.

The techniques discussed below can reduce the data volume on the client. The impact on the speed of subsequent data manipulation should be evaluated:

It is unclear how swiftly specific functionalities can be implemented in JavaScript, the potential slowdown in code performance, and the complexity it might introduce. A subset of these features should be implemented promptly, while the rest could likely be addressed using WebAssembly.

miranov25 commented 1 month ago

N-bit Integer Representation

An array of 1, 2, 3, ... N-bit integers can be represented using a bit array.

Consider an array of M numbers, each with N bits. To represent them, you can use a Uint8Array of size (N * M) / 8.

How efficient is random access in such an array in JavaScript?

pl0xz0rz commented 4 weeks ago

What is not mentioned here but should help improve memory consumption: -Bitmasks take up too much space - while not that much of an issue if there's more columns than histograms, reducing this by a factor of ~2 should be simple

miranov25 commented 3 weeks ago

Bitmasks for Widget Selection

This section explains the bitmask representation used for widget selection, which is ideal for inclusion in READMEdeveloper.md or similar documentation.

Definition

We utilize a bitmask representation where each widget and point is represented by a bit.

Memory Allocation

Ideally, we should allocate an array of 32 bits for the product of N_points and int(N_widgets/32):

Selection Logic

Implementation Differences (described above and new Marian pull request)

In the existing implementation by Marian, a different logic is used where the columns are bitmask per widget, sized at N_points/32.

Additional Memory Requirements for Special Functions

The advantage of Marian’s column-per-widget implementation lies in its update method, which modifies only the corresponding column of the bitmask when a single widget changes. In the row bitmask representation, the decision-making process is swift and does not require a special column. However, since this column is small, the memory savings are minimal (N_points x 32 bits).

miranov25 commented 3 weeks ago

Custom logic imlementation - for later implementation (for later)

Currently the selection is only logical && of selections from differnt widgets

In old PAW system configuration was defiend as a custom user defiend string - as shown above. For us it means that widgets needs IDs (as string) and in cusom logic form this IDS could be used .

miranov25 commented 3 weeks ago

refer to #96

miranov25 commented 3 weeks ago

Optimization of the funCustom - needed to reduce need for exanded functions

Title: Automatically Generate Function Header for JavaScript Function Based on Used Variables

Description:

We have a list of all variables present in a table and a JavaScript function (as a string without the header) that uses a subset of these variables. Our goal is to automatically generate the function header for this JavaScript function, including only the variables that are actually used in the function.

This is needed as we would like to keep columns compressed and use (expand to) float representation only if needed.

Use Case:

  1. Input:

    • A list of all variables (columns).
    • A JavaScript function as a string (functionString).
  2. Output:

    • A function header that includes only the variables from the list that are used in the function string.

Proposed Solution:

We will implement an algorithm to achieve this by iterating through the list of variables and checking if each variable is present in the function string. If a variable is found in the function string, it will be added to the function header.

Pseudocode:

function generateFunctionHeader(columns, functionString):
    Initialize an empty list `columnArgs`

    for each column in columns:
        if column is present in functionString:
            add column to `columnArgs`

    Create the function header string `header`
    header = "function XXX(" + join(columnArgs, ", ") + ") {"

    return header

JavaScript Code:

function generateFunctionHeader(columns, functionString) {
    let columnArgs = [];

    columns.forEach(column => {
        if (functionString.includes(column)) {
            columnArgs.push(column);
        }
    });

    let header = "function XXX(" + columnArgs.join(", ") + ") {";

    return header;
}

// Example usage:
let columns = ["v0", "v1", "v2", "v3", "v4"];
let functionString = "let sum = v0 + v1; let product = v1 * v3; return sum + product;";

let functionHeader = generateFunctionHeader(columns, functionString);
console.log(functionHeader);  // Output: "function XXX(v0, v1, v3) {"
miranov25 commented 2 weeks ago

Goal:

Node.js Speed Benchmark:

To evaluate the use of compressed float representations for data manipulation, we should compare CPU usage across different types of operations:

Representations:

Operations:

Benchmark:

miranov25 commented 2 weeks ago

Node.js vs Browser CPU Benchmark Consideration

Marian's Observation-Statement

"When I look at performance, there is quite a big difference between the browser and Node.js; in some code pairs, one is faster in the browser and the other in Node.js. Therefore, I believe it is more realistic to perform benchmarks in the browser and restrict Node.js to just unit tests."

Marian, could you please provide some examples? Are the differences a matter of percentage points or orders of magnitude?

Proposal for Continuing CPU Benchmarks in Node.js

I suggest we maintain CPU benchmarks for the numerical parts of the code in Node.js as well, since it is easier to automate and compare. For regression testing, this will be very important. We should soon move more towards WebAssembly, where I expect the results will be more predictable.

Suggested Solution - As per GPT:

Based on the observation that performance varies significantly between the browser and Node.js environments for different code snippets, the suggestion to focus benchmarking in the browser and limit Node.js to unit testing appears reasonable. Here's a more detailed approach to implementing this strategy:

  1. Isolate Performance-Sensitive Code: Identify parts of the codebase that are performance-sensitive and may behave differently in a browser compared to Node.js. This includes DOM manipulations, graphical operations, or operations that rely heavily on browser APIs.

  2. Conduct Benchmarks in the Browser: Use tools like Google Chrome’s Lighthouse or Mozilla Firefox’s Developer Tools to run performance benchmarks directly in the browser. This method provides a more accurate measurement of how the code performs in its intended environment.

  3. Set Up Automated Browser Tests: Implement automated browser testing frameworks such as Puppeteer or Selenium. These tools help automate benchmarks and integrate them into your continuous integration (CI) pipeline, regularly monitoring performance as part of your development process.

  4. Utilize Node.js for Unit Testing: Since Node.js may not accurately replicate the browser environment for all types of code, it should primarily be used for unit testing backend functionalities or non-UI logic that does not heavily depend on browser capabilities. Frameworks like Mocha, Jest, or Jasmine are suitable for these tasks.

  5. Performance Regression Testing: Regularly compare new performance measurements against previous benchmarks to identify any regressions or improvements. This can be integrated into your CI/CD pipeline to ensure continuous performance evaluation.

  6. Documentation and Reporting: Maintain detailed records of performance tests and their results. Utilize this data to generate reports for further analysis and to guide optimization efforts.

  7. Feedback Loop: Use insights from both browser benchmarks and Node.js tests to inform development decisions, focusing on optimizing areas where performance bottlenecks are identified.