pystorm / streamparse

Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.
http://streamparse.readthedocs.io/
Apache License 2.0
1.49k stars 217 forks source link

More tasks per executors for networking intensive tasks #314

Open lboudard opened 7 years ago

lboudard commented 7 years ago

Hi,

I've seen in 'Storm applied' book that bolts that have to wait for IO/network, querying an API or upserting in a distant db, should rather be configured with many tasks per executor ('topology.tasks'). http://storm.apache.org/releases/1.0.2/Understanding-the-parallelism-of-a-Storm-topology.html

However, I was wondering what was the general best practice in case of a ShellBolt that manage python process for such tasks (my guess is that increasing tasks per executor will not help at all?).

Currently I'm testing a topology that process batch fetches/upserts to a distant db within a batching bolts, performance is more or less OK, though those bolts tends to have rather high process latency (btw execute latency doesn't seem correct in case of batching bolt though), and I was wondering the optimal parallelism setup.

Have you experienced such uses cases and have any advice on this?

Thanks!

fedelemantuano commented 7 years ago

Hi, is it possible with streamparse settings executors and tasks?

The manual defines excecutors as threads in JVM and tasks as istance of bolts/spouts. But I think that in streamparse every bolts/spouts is a process. Is it right?

@lboudard maybe there is a typo in link (http://storm.apache.org/releases/1.0.2/Understanding-the-parallelism-of-a-Storm-topology.html)

Thanks

dan-blanchard commented 7 years ago

@lboudard, when you're dealing with multi-lang components (like in streamparse), each executor (JVM thread) maps to a single Python process. Increasing the number of tasks per executor wouldn't buy you anything (and would likely make things slower), because Python can't actually process two tuples simultaneously. If I ever get the time to finish pystorm/pystorm#24, we would have asynchronous components that would offer higher throughput in these scenarios.