markgw / pimlico

The Pimlico Processing Toolkit
http://pimlico.readthedocs.org/
GNU Lesser General Public License v3.0
6 stars 1 forks source link

Batch processing of documents in map modules #25

Open markgw opened 4 years ago

markgw commented 4 years ago

Currently a document map module feeds one document at a time to worker processes and then processes the results as they come back. I suspect that this:

In the latter case (e.g. a simple text document filter), we probably end up spending quite a bit of time just passing documents back and forth, waiting for the postprocessing to complete before the worker processes can get on with the next documents.

A possible solution is to send batches of documents in this case. 8a8143ca33073a0c6165507b43677eba61266f9f makes it possible to batch documents in the input feeder.

A key question is how big the batches should be and how this should be set. Note that the input feeder can cope with this parameter being changed in the middle of processing. So, we could make it adaptive: measure how long it's taking to process each doc on average and adjust the batch size to match.