nodejs / worker

Figuring out native (Web?)Worker support for Node
Other
87 stars 21 forks source link

Step 1: Figure out use cases? #1

Closed addaleax closed 6 years ago

addaleax commented 7 years ago

I think one of the first things we’ll want to do is to figure out what the actual use cases for (Web)Workers in Node are, so that we have a better idea of the requirements the API and implementation have to fulfill.

I guess Workers could be applied to anything that needs a fast way to share information across some kind of parallel applications, but a lot of that can already be addressed using multiple processes and standard IPC methods, so what’s going to be most interesting is to hear what one can not do fast right now without them.

addaleax commented 7 years ago

I can also go first: The one time I really wanted workers in Node was when building the coverage tooling for Node core itself; Parsing and serializing the JSON coverage information is a noticeable bottleneck, so being able to distribute that across multiple threads would be nice to have.

Fishrock123 commented 7 years ago

The first thing that comes to my mind: off-I/O-thread template rendering.

matthewp commented 7 years ago

Code portability.

chrisdickinson commented 7 years ago

Seconding @Fishrock123's suggestion of off-I/O-thread template rendering — especially for server side rendering of React. Along the lines of JSON parsing, I suspect it would allow for much more efficient architectures for things like Babel & module bundlers, since this would allow authors to hand off the main-thread-blocking JS parse to a pool of threads (& possibly do so with a cheap object transfer?)

vladholubiev commented 7 years ago

Parsing/Generating .csv ?

For example http://papaparse.com/ does this efficiently on client-side using web workers (I guess)

vkurchatkin commented 7 years ago

@chrisdickinson

possibly do so with a cheap object transfer

is that a real thing or something hypothetical?

addaleax commented 7 years ago

@vkurchatkin We do have an actual serialization/deserialization API now in V8, that should be a pretty good start. It’s basically what Chrome uses for their WebWorkers, so that should qualify for most people’s understanding of “cheap”.

(I’d like to keep this thread on-topic for use cases; don’t be shy to open new issues here for anything. People who go watch the repo are opting into the noise. ;))

vkurchatkin commented 7 years ago

@addaleax fair enough. My problem with usecases is that unless a usecase is specified well enough, you could argue that anything can be solved with current multi-process model, including all examples above. For me in process workers are about efficiency and simplicity, not usecases.

refack commented 7 years ago

IMHO a strong use-case as @matthewp mentioned is code portability (a.k.a isomorphism). Sorry bring up cluster again, but it was homegrown, while WebWorker has a wider acceptance as a standard.

refack commented 7 years ago

Another use-case that comes to mind is low-priority tasks (even I/O bound). Things like Cache-Hydration, off-line batch-processing, etc. Current implementations depend on the OS for prioritizing the main process vs the workers, WebWorker has the goal of keeping the main "thread" responsive.

inikulin commented 7 years ago

For parallelization of tokenization and tree construction in parse5. This can be handy for all parsers, I guess.

kgryte commented 7 years ago

Numeric computation. The ability to more cheaply distribute numeric computational tasks to multiple workers, as commonly found in machine learning and analysis of larger datasets (akin to MATLAB's parfor).

With the current model, need to perform wholesale copying of data between multiple processes. With shared array buffers, workers can operate on the same buffer, allowing better memory efficiency and performance.

In short, better environment support for parallel map-reduce style operations would be highly beneficial as Node.js applications become more computationally intensive.

alexeagle commented 7 years ago

Compilers like TypeScript and Angular would parallelize parts of the pipeline

refack commented 7 years ago

[question] for the parsing and heavy computation use-case. What do see as the specific benefits of Worker? multi core utilization and/or ability to keep a main thread responsive?

matthewp commented 7 years ago

@refack Keeping the main thread responsive. ex, for a server not blocking incoming requests.

inikulin commented 7 years ago

@refack Different stages of parsing (e.g. input preprocessing, tokenization, syntactic analysis) can be performed in parallel, thus reducing cumulative parsing time.

refack commented 7 years ago

Thank you @matthewp & @inikulin. That's what I wanted to know. 💯

inikulin commented 7 years ago

Just another example of worker use case popped from the top of my mind: parallelization of the gulp build tasks. Currently computationally heavy tasks (e.g. linting, compilation) can't be performed in parallel due to their blocking nature. Service workers should significantly reduce build times.

refack commented 7 years ago

A reference for those who don't "watch" this repo - spinoff discussion on High level architecture

domenic commented 7 years ago

In jsdom, we would like this for two reasons:

We could also benefit from it for keeping the main thread responsive and parallelizing multiple files by doing background HTML and CSS parsing, as @inikulin alludes to.

boneskull commented 7 years ago

@domenic couldn't parallelization of file parsing be accomplished via cluster or IPC?

domenic commented 7 years ago

No, because the serialization overhead of sending it over IPC outweighs the benefit.

vkurchatkin commented 7 years ago

@domenic the problem is that serialization is required anyway

domenic commented 7 years ago

Not when using SharedArrayBuffer (or transferring normal ArrayBuffers). And also not when using strings (which are immutable and thus don't need to be serialized).

Fishrock123 commented 7 years ago

Ah yes, right, I had almost forgot.

SharedArrayBuffer certainly makes Workers a lot more desirable.

DronRathore commented 7 years ago

At Housing.com our Node processes listens to exchanges of RabbitMq to flush and update cached keys that we keep in memory, for e.g. List of whitelist domains, List of cities, List of experiments etc. It would be great if that task gets off-loaded from our main app and somehow through Shared buffers or other means we can do the updating of that cache.

This is one of the usecase where you want the background tasks to be really working in background and not in your process's main thread.

ljharb commented 7 years ago

At Airbnb, we use https://npmjs.com/hypernova as a long-running node React (we only use React, but it can render anything) rendering service for our Rails app to use as a service. Web Workers seem like they would make for a much more efficient sandbox than vm currently offers to be able to render each request/batch of jobs in an isolated fashion.

bnoordhuis commented 7 years ago

And also not when using strings (which are immutable and thus don't need to be serialized).

@domenic Immutable but not fixed. Strings are moved around by the garbage collector. They need to be copied out before they can be used in another VM.

Your point about ArrayBuffers and SharedArrayBuffers is correct though.

NickNaso commented 7 years ago

In my work i need some sort of concurrency because i have to analyze great amount of data and big file json csv etc... At the beginning i tried to use node-webworker-threads but it cannot allows to access at Node.js API from the Worker. I think that this is a great limitation.

addaleax commented 7 years ago

@NickNaso I’m curious, what parts of the Node.js API would you use in the workers? Could you explain a bit more why that is important to you?

sebdeckers commented 7 years ago

Cached http responses can be shared by workers in a webserver. For example expensive-to-compute Brotli/Zopfli encoding of static assets.

Alternatives:

Other languages like PHP have long offered shared memory and semaphores for similar purposes.

NickNaso commented 7 years ago

@addaleax I want to use File System API because in my work i use it a lot. It's important because i have many files to analyze (read - parse - extract information) and i like to do this operations concurrently.

inikulin commented 7 years ago

@NickNaso but IO is non-blocking and happens in parallel already. You can pipe data from main thread that performs IO to workers that will perform heavy computations.

pemrouz commented 7 years ago

@addaleax, could you clarify what you mean by no Node API exactly btw? Do you mean no require?

addaleax commented 7 years ago

@pemrouz That would be up to debate. :) The possible interpretations would range from no require to just having the bare minimum (run JS code & communicate with main thread), I guess.

NickNaso commented 7 years ago

@inikulin Yes you are right all the IO is non-blocking so in a future scenario i can read files from the main thread and i can pass the stream to the worker. It's ok but i have some curiosities: In general in Worker can i use require and import other modules? And if the module is a Native addons?

addaleax commented 7 years ago

In general in Worker can i use require and import other modules?

The reason I’m asking for use cases is partly that I want to figure out whether that will be a goal or not.

And if the module is a Native addons?

That depends on the implementation; that’s something that wouldn’t likely be possible in thread-based Workers (or at least not out of the box, addons would have to opt into that), but would likely be possible in process-based workers.

But then again, if you’re using native addons you’re already having access to multi-threading, at least in some way.

inikulin commented 7 years ago

The reason I’m asking for use cases is partly that I want to figure out whether that will be a goal or not.

Yes, it's definitely a desired feature. I guess any non-trivial computational task can be quite big in terms of source code size, so it would be quite nice to have ability to split it into modules. Moreover, there can be reusable modules that can be used both by the main thread and workers.

NickNaso commented 7 years ago

@addaleax Ok now i have the ideas more celar so i can say that in my work i will use Worker to execute parsing of my data. Sorry for my previous threads

jokeyrhyme commented 7 years ago

@ljharb

Web Workers seem like they would make for a much more efficient sandbox than vm

Seems like the realms proposal might be a more specific solution for isolation: https://github.com/tc39/proposal-frozen-realms

ljharb commented 7 years ago

@jokeyrhyme indeed, that would help with isolation, but not with parallelization.

addaleax commented 7 years ago

Web Workers seem like they would make for a much more efficient sandbox than vm

Seems like the realms proposal might be a more specific solution for isolation: https://github.com/tc39/proposal-frozen-realms

As far as I can tell, there are things you can’t do with Realms that you could do with Workers, like limiting memory usage (which you’ll usually want when running untrusted code is your goal).

ljharb commented 7 years ago

Effectively what I want is both control over a Worker and over a Realm, but having a Worker gives me a Realm for free (and I assume that once the Realms API lands, I'll be able to use it in conjunction with creating a Worker, whether on the web or in node).

cpojer commented 7 years ago

There are a bunch of use-cases that people identified here and if we zoom out, it all comes down to one high-level feature: Map-reduce with efficient data structure sharing. Almost all problems can be reduced to that: bundlers need worker processes that offload parsing/compilation to other threads/processes. Test runners would like to parallelize test runs. Client side frameworks would like the ability to do server-rendering efficiently and in parallel. Almost all of these are CPU bound and currently slowed down by slow IPC. I'm happy to dogfood any implementation proposals in projects that we work on at Facebook.

zxc122333 commented 7 years ago

May bring us some new programming models, like Actor model of Erlang and Akka.

Some libraries are exploring it, even in single-threaded environments.

shakyShane/actor-js untu/comedy alexeyraspopov/actor-system

NawarA commented 7 years ago

Today, we process billions of requests. We use Cluster with n-1 number of workers, so at least one CPU is available to process incoming requests on the OS level. That being said, as a Node.js user, we have the interest of keeping the event loops unblocked that process requests.

In a normal software model, we have hotpath (or critical path to get to response as fast as possible) and what can be considered background work. The most ideal setup is to have worker Event Loops process hot path, and pass work to Background Worker whose soul purpose is to process non-critical, yet important functions, such as logging, processing database reads and writes that can be done in the background, etc. This work can be done after hotpath is complete. Yet the event loop processing requests doesnt have to busy itself with that work, since its an Actor worried about a different objective.

Forgive the ascii graphic...the model I propose looks something like this:

1 master -to- n cluster workers 
     |             |...|
n background process workers

The outcome is Cluster Master does is its job round robining requests without being blocked and healing / spawning Cluster Workers that process Internet Requests. Every Cluster Worker pushes its non-critical background logic to a Background Workers. This frees Cluster Workers to be solely focused on throughput.

I recommend a postMessasge design where you can pass a PID, (possibly a class name?), and an object (a message). Like the postMessage API with anycast capability

I think this kind of capability and specific use case makes Node.js software designs capable of being generally more efficient

tjconcept commented 7 years ago

I guess I have a use case, and if not, I hope someone can point me in another direction ;)

I have some static data that is loaded and indexed (a specialized tree structure). The service is spending almost all of it's CPU time traversing this (static) tree. The downside is that I waste almost 1GB of memory per process I scale to.

The best scenario I can imagine, being a guy with zero threads experience, is if a variable can simply be shared across all instances spawned by the cluster module (maybe in a frozen condition) - but I guess that's not feasible.

pemrouz commented 7 years ago

I think @cpojer sums this up well: most of the use cases are basically map-reduce.

However, in my case (fero), I essentially have a microservice consuming one log of events, processing it and producing another to fan out to clients. I wanted to split this so the business logic is run in one thread, and another thread picks up the output and writes to the wire.

This seems like the perfect use case for shared memory, since the output is already a Buffer, so offloading the socket.write should make things faster. However, all the userland implementations I tried actually made performance worse than doing it in a single thread, so it'd be interesting to see if native SharedArrayBuffer's will actually provide a performance boost.

@addaleax - in response to your question, I think some form of require would be required. I need to be able to do some I/O, and even for most of the map-reduce cases where they just process and respond back, I think without being able to access other files or npm modules they would too limited to be useful.

refack commented 7 years ago

/cc @mogill

devsnek commented 7 years ago

I had an idea to be using these for sharded network communication with external services, mainly over websocket. I'm not sure what the benefits of offloading json/etf parsing to a worker would be, mainly since i'm not sure of the actual performance of a good structured clone algorithm, but I think that use cases like this might be taken into account.