terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

Parallel GridSearch on Pipeline Fails #427

Closed carsonwoods closed 4 months ago

carsonwoods commented 5 months ago

Describe the bug I am a relatively new user of PyTerrier. As part of familiarizing myself with PyTerrier, I am tuning a pipeline using GridSearch. Using the default joblib backend, I tried to parallelize my GridSaerch with 8 jobs (jobs=8). The GridSearch runs fine sequentially, but when I enabled the parallelization it failed. I've included the full error message in the reproduction steps below.

To Reproduce Steps to reproduce the behavior:

  1. My dataset was provided as part of an academic course, but we were told it was comparable to the TREC Robust 2004 dataset. The index was created using the provided pt.TRECCollectionIndexer function.
  2. The retrieval being used was an BM25 BatchRetrieval with RM3 query expansion.
  3. The final RM3 pipeline was instantiated using the following code: rm3_pipe = tuned_bm25 >> rm3 >> tuned_bm25 where tuned_bm25 was a BM25 BatchRetreval and rm3 was a pt.rewrite.RM3 object.
  4. The grid search was run using the following code:
    rm3_tuned_pipe = pt.GridSearch(
    pipeline=rm3_pipe,
    params=rm3_params,
    topics=training_topics,
    qrels=training_qrels,
    metric="map",
    jobs=8,
    )
  5. This resulted in the following error:
    
    joblib.externals.loky.process_executor._RemoteTraceback:
    """
    Traceback (most recent call last):
    File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/externals/loky/backend/queues.py", line 125, in _feed
    obj_ = dumps(obj, reducers=reducers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 211, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
    File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 204, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
    File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
    File "<stringsource>", line 2, in jnius.JavaClass.__reduce_cython__
    TypeError: no default __reduce__ due to non-trivial __cinit__
    """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carsonwoods/Library/CloudStorage/Dropbox/School/Spring-2024/CS572-Information-Retrieval/cs572-information-retreival/hw1/rm3_ranker.py", line 152, in rm3_tuned_pipe = pt.GridSearch( ^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/pyterrier/pipelines.py", line 752, in GridSearch grid_outcomes = GridScan( ^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/pyterrier/pipelines.py", line 906, in GridScan eval_list = parallel_lambda(_evaluate_several_settings, batched_inputs, jobs, backend=backend) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/pyterrier/parallel.py", line 52, in parallel_lambda return _parallel_lambda_joblib(function, inputs, jobs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/pyterrier/parallel.py", line 63, in _parallel_lambda_joblib return parallel_mp( ^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/parallel.py", line 1098, in call self.retrieve() File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/parallel.py", line 975, in retrieve self._output.extend(job.get(timeout=self.timeout)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result return future.result(timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/Users/carsonwoods/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in get_result raise self._exception _pickle.PicklingError: Could not pickle the task to send it to the workers. /Users/carsonwoods/anaconda3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '


**Expected behavior**
I expected the GridSearch to proceed as expected across 8 jobs and return the best version of the pipeline. This was the behavior when I tried tuning a simple BM25 BatchRetrival using GridSearch. 

**Documentation and Issues**
 - [x] I have checked the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/) for relevant content
 - [x] I have checked for previous relevant [PyTerrier issues](https://github.com/terrier-org/pyterrier/issues)

**Additional context**
```bash
$ uname -a
Darwin Carsons-2021-MacBook-Pro.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64
$ python --version
Python 3.11.5
$ pip show python-terrier
Name: python-terrier
Version: 0.10.0
...
$ java --version
java 18 2022-03-22
Java(TM) SE Runtime Environment (build 18+36-2087)
Java HotSpot(TM) 64-Bit Server VM (build 18+36-2087, mixed mode, sharing)
cmacdonald commented 5 months ago

Hey @carsonwoods, thanks for the report.

I think the issues is that parallel gridsearch requires the constituent parts of the pipeline to be picklable.

for instance, BatchRetrieve has __reduce__ etc: https://github.com/terrier-org/pyterrier/blob/master/pyterrier/batchretrieve.py#L279-L300

but QueryExpansion does not.

For testing, we would need to add some RM3 like pipelines to https://github.com/terrier-org/pyterrier/blob/master/tests/test_pickle.py

If you are super-keen for the functionality, would you be able to try to form a PR for this?

carsonwoods commented 5 months ago

Hi @cmacdonald, thanks for the quick reply! That makes sense as to why I was seeing that issue.

I'd love to work on a PR for this, but I'm currently a bit swamped with work for my academic program. If no one else works on this, I will try to take this on when I have time, but I'm unsure how quickly that will happen.

cmacdonald commented 4 months ago

Hi @carsonwoods

430 is a PR in which I think I have addressed the pickling. If you have a parallelised environment already for GridSearch, could you try this branch?

carsonwoods commented 4 months ago

@cmacdonald Thanks for working on this! I tested my environment against your PR and everything works perfectly. Thanks again for getting this capability added so quickly. Feel free to close this issue once the PR is merged!

cmacdonald commented 4 months ago

Ok, I will take it as working and merge. Glad you found it useful.