microsoftarchive / data-pipeline

Exploring the problem of high-scale data ingestion on Azure
MIT License
6 stars 4 forks source link

consider the implementation of Dataflow in Dispatcher #19

Open bennage opened 9 years ago

bennage commented 9 years ago

The Dispatcher.EventProcessorHost/EventProcessor uses TPL Dataflow for bounding concurrency.

Currently, ActionBlock is instantiated for each event received. Since there is nothing event-specific about the block, would instantiating it at the event processor level have a performance impact?

We need to measure the performance impact and make the change if there is a benefit.

mabsimms commented 9 years ago

Instantiating at the event host processor wouldn't allow await'ing completion of all of the messages. Need to update this to make it a continuous pipeline along the lines of:

BufferBlock (bounded depth)) -> ActionBlock (bounded concurrency)

where the action block updates the "latest event" processed, which can then be periodically checkpointed from the ProcessEvents call (as the checkpoint methods can ONLY be called within the context of event processor host methods).

mabsimms commented 9 years ago

Otherwise replace the ActionBlock with a Parallel.For, but my preference would be for the continuous processing pipeline in TPL DataFlow.