Closed MikeChenfu closed 5 years ago
_emit
is a very simple implementation, all it does is calls .update
for all the currently subscribed downstreams collates the results and then returns the results.
emit
is a much more advanced implementation, which takes into account if the pipeline is being run asynchronously and handles the loop synchronization, couroutine generation, and such.
If you are expecting to use/support the async aspects of the library you should use .emit
.
How different is the performance? Can you post a profile of the pipeline?
To be sure, the "performance" will very likely depend upon what your pipeline is doing, and how you have set it up. In some circumstances (no async/ioloops), emit
just calls _emit
. If you are running async, however, each subsequent task in emit
will generally be scheduled on the next ioloop tick, rather than running in blocking mode as in _emit
.
Thanks @CJ-Wright and @martindurant. In my understanding, theoretically emit
should have a better performance than _emit
. But my result does not show that.
I process 50 files on two dask-workers in my code.
Here is the result I use emit
. It consumes 90 seconds totally.
Here is the result I use _emit
. It consumes about 40 seconds and It can be found that there is overlap between reading files and kernel operations.
emit()
calls _emit()
, so no way it could be more efficient https://github.com/python-streamz/streamz/blob/master/streamz/core.py#L306
It would be interesting to find out what is happening during those white stripes: I assume this is back-pressure in action, the system is waiting for futures to become finished and gathering results before firing off more work.
@martindurant I am also curious about these white stripes.
Besides, I just tried .emit(fn,asynchronous=True)
. I got a good performance like_emit
's result. Is it possible '.emit' treat the process as a synchronous process? I process 50 csv files in my code.
Here is my code to read files.
async def f():
for fn in glob.glob('data/*.csv'):
source.emit(fn,asynchronous=True)
IOLoop.current().add_callback(f)
Would it be possible to post the pipeline you are using?
I suspect that by calling _emit you're effectively ignoring backpressure.
My guess is that you want a buffer
somewhere in your pipeline.
On Wed, Jun 12, 2019 at 6:12 PM Christopher J. Wright < notifications@github.com> wrote:
Would it be possible to post the pipeline you are using?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/python-streamz/streamz/issues/256?email_source=notifications&email_token=AACKZTCLMJPBZPTXTMVCQ63P2EN5DA5CNFSM4HXAJE2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQ6W2Q#issuecomment-501345130, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTB4BVOMZ46ODDBML3LP2EN5DANCNFSM4HXAJE2A .
@CJ-Wright Here is the graph regarding to the workflow.
Yep, @mrocklin is correct, you need a buffer before your gather nodes so multiple things can be processed at once. Otherwise the pipeline will wait for the computation to be finished before the next is processed.
@CJ-Wright @mrocklin Yeah, get a better result. Is there any tips to set the size of buffer
? I try different size of buffer
and get the different performance.
The size of the buffer depends on how much compute and ram you have available. I usually go with larger numbers and let dask figure it out.
I think this is mostly resolved. Discussion of buffer size should most likely be an independent issue (and a PR into the docs if we are able to come up with a heuristic approach).
@MikeChenfu please feel free re-open if you need more help with this issue.
I am a new guy to the Streamz. I just try some examples and find ._emit() has a good performance than .emit(). I really appreciate if someone can give me some details about them.