terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

Parallel execution for experiments #413

Closed lukaszett closed 8 months ago

lukaszett commented 8 months ago

Is your feature request related to a problem? Please describe. As the grid search for tuning transformers already allows for parallel execution of multiple retrieval processes, it would only make sense to also add this to experiments. I know.parallel() already exists for transformers, however I'd wish for a way to execute multiple pipelines at the same time.

Describe the solution you'd like I haven't looked any deeper into this, however I'd like an interface similar to this of GridSearch. Parallelization is triggered by defining jobs and backend when starting the search / experiment.

Describe alternatives you've considered None

Additional context This can be especially useful for experiments on pipelines that require external inputs, e.g. I'm using the results of ElasticSearch and want to experiment with different rerankers implemented using PyTerrier. Waiting for the results of the first retrieval step takes up a large amount of time that could be reduced using parallelism.

cmacdonald commented 8 months ago

Thanks for information about your usecase @lukaszett.

Lots of the backend joblib and ray backends require serialization support, and thats a bit scary, and difficult to get right, particularly to make a good API for arbitrary custom transformers.

I'm also not sure how GPU-based rerankers perform using multiple threads/processes (the models end up being loaded multiple times).

I wonder if your Elastic transformer could be configured to using multithreading?

lukaszett commented 8 months ago

That concern is understandable, I wouldn't feel comfortable making this change myself either. ;-)

I'll just parallelise my first stage! Thanks!