Currently a document map module feeds one document at a time to worker processes and then processes the results as they come back. I suspect that this:
works well when processing takes a substantial amount of time
introduces unnecessary overhead when processing is very fast.
In the latter case (e.g. a simple text document filter), we probably end up spending quite a bit of time just passing documents back and forth, waiting for the postprocessing to complete before the worker processes can get on with the next documents.
A possible solution is to send batches of documents in this case.
8a8143ca33073a0c6165507b43677eba61266f9f makes it possible to batch documents in the input feeder.
A key question is how big the batches should be and how this should be set. Note that the input feeder can cope with this parameter being changed in the middle of processing. So, we could make it adaptive: measure how long it's taking to process each doc on average and adjust the batch size to match.
[ ] Establish whether batching can actually speed up processing and under what circumstances it's useful.
[ ] Either provide a parameter to set batch size, or come up with a way to set it automatically on the basis of per-doc processing time.
Currently a document map module feeds one document at a time to worker processes and then processes the results as they come back. I suspect that this:
In the latter case (e.g. a simple text document filter), we probably end up spending quite a bit of time just passing documents back and forth, waiting for the postprocessing to complete before the worker processes can get on with the next documents.
A possible solution is to send batches of documents in this case. 8a8143ca33073a0c6165507b43677eba61266f9f makes it possible to batch documents in the input feeder.
A key question is how big the batches should be and how this should be set. Note that the input feeder can cope with this parameter being changed in the middle of processing. So, we could make it adaptive: measure how long it's taking to process each doc on average and adjust the batch size to match.