pola-rs / nodejs-polars

nodejs front-end of polars
https://pola-rs.github.io/nodejs-polars/
MIT License
440 stars 45 forks source link

Performance Bottleneck in nodejs-polars When Creating Multiple Expr Objects #265

Open chenstarx opened 2 months ago

chenstarx commented 2 months ago

Have you tried latest version of polars?

What version of polars are you using?

0.0.15

What operating system are you using polars on?

MacOS 14.4.1 (M3 Pro)

What node version are you using

v20.12.1

Describe your bug.

I have encountered a significant performance issue when using the nodejs-polars library. Specifically, the time required to create multiple Expr objects is considerably higher compared to the Python version of polars.

What are the steps to reproduce the behavior?

To illustrate the issue, I conducted a performance test by generating one million Expr objects in both nodejs-polars and Python polars. The following code snippets demonstrate the test setup:

Python Code

import polars as pl
import time

start_time = time.time()

for _ in range(1000000):
    _ = pl.col("A") > pl.lit(1)

end_time = time.time()
print(f"Time taken in Python: {end_time - start_time} seconds")

Node.js Code

const pl = require('nodejs-polars');
console.time('Expr Creation');

for (let i = 0; i < 1000000; i++) {
  const expr = pl.col("A").gt(pl.lit(1));
}

console.timeEnd('Expr Creation');

What is the actual behavior?

Python polars: Approximately 7 seconds to create 1,000,000 Expr objects. Node.js polars: Approximately 1,000 seconds to create the same number of Expr objects.

Impact

This performance discrepancy presents a significant bottleneck when performing operations that require frequent creation of Expr objects in nodejs-polars. It substantially limits the library's usability for large-scale data processing tasks in a Node.js environment.

What is the expected behavior?

The performance of creating Expr objects in nodejs-polars should be closer to, or ideally match, the performance in the Python version of polars.

Possible Reason

The issue might be caused by _Expr that will create an new Expr object when executed. Each execution will take about 0.5ms in my laptop, consuming considerable time if executed million times. Moreover, In my test dataset, the actual computing time for millions rows of data is very short, most of the time was wasted in creating the Expr objects.

There might be two ways to solve this problem:

  1. Rewriting _Expr with class, and return this in each expression method, avoiding time-consuming operations of creating a complex new object in javascript.
  2. Turning the Expr into mutable object, updating its attributes rather than creating an new object in each expression.

Thanks for reading the issue, I hope my suggestion would be helpful for the nodejs-polars library.

maizhichao commented 2 months ago

I encountered the same issue.

Bidek56 commented 2 months ago

It's partly a node issue, have you tried bun or deno? I get much better performance using those 2 vs. node

maizhichao commented 2 months ago

It's partly a node issue, have you tried bun or deno? I get much better performance using those 2 vs. node

My project highly depends on Node.js, it's hard to migrate to Bun or Deno...

Bidek56 commented 2 months ago

Anything other operation using node takes much longer than using Bun For example I just tried: [...Array(1_000_000)].forEach((_, i) => pl.Series("abs", [1, 2, 3])); It takes many times longer in node than in Bun Why do you think re-writing just _Expr will solve your performance issue?

Bidek56 commented 2 months ago

For me Bun is 20% faster than Python when running: [...Array(1_000_000)].forEach((_, i) => pl.Series("abs", [1, 2, 3]));

universalmind303 commented 2 months ago

@maizhichao unfortunately, this is a known issue with nodejs. Their FFI implementation is incredibly slow compared to python or other js engines (deno, bun). The main bottleneck is sending values over n-api, which refactoring Expr will not fix.