sherpa-ai / sherpa

Hyperparameter optimization that enables researchers to experiment, visualize, and scale quickly.
http://parameter-sherpa.readthedocs.io/
GNU General Public License v3.0
331 stars 53 forks source link

Parallel Sherpa MongoDB access issues #105

Open djgagne opened 4 years ago

djgagne commented 4 years ago

I have tried running the parallel simple.py and mnistmlp examples, but when I do, I keep getting the following error in the jobs/trial_*.out files about connecting to the database.

2020-09-03 10:40:27.957058: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
warning in stationary: failed to import cython module: falling back to numpy
warning in coregionalize: failed to import cython module: falling back to numpy
warning in choleskies: failed to import cython module: falling back to numpy
Traceback (most recent call last):
  File "trial.py", line 79, in <module>
    trial = client.get_trial()
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/sherpa/database.py", line 222, in get_trial
    t = next(g)
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/sherpa/database.py", line 221, in <genexpr>
    g = (entry for entry in self.db.trials.find({'trial_id': trial_id}))
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/cursor.py", line 1207, in next
    if len(self.__data) or self._refresh():
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/cursor.py", line 1100, in _refresh
    self.__session = self.__collection.database.client._ensure_session()
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1816, in _ensure_session
    return self.__start_session(True, causal_consistency=False)
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1766, in __start_session
    server_session = self._get_server_session()
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1802, in _get_server_session
    return self._topology.get_server_session()
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/topology.py", line 488, in get_server_session
    None)
  File "/glade/u/home/dgagne/miniconda3/envs/goes/lib/python3.7/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
    (self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: casper26:27001: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 5f511c8580fc9c3448b850b1, topology_type: Single, servers: [<ServerDescription ('casper26', 27001) server_type: Unknown, rtt: None, error=AutoReconnect('casper26:27001: [Errno 111] Connection refused')>]>

Any ideas on what may be going wrong? I installed mongodb through conda. The main program also completes with no errors, but there are no summary results at the end.

ggantos commented 4 years ago

Hello, I would like to second that I am having the same issue. Any help would be appreciated. Thanks!

2020-09-04 09:09:25.204755: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "train_conv2d_zdist_sherpa_parallel.py", line 33, in <module>
    trial = client.get_trial()
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/sherpa/database.py", line 222, in get_trial
    t = next(g)
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/sherpa/database.py", line 221, in <genexpr>
    g = (entry for entry in self.db.trials.find({'trial_id': trial_id}))
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/cursor.py", line 1207, in next
    if len(self.__data) or self._refresh():
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/cursor.py", line 1100, in _refresh
    self.__session = self.__collection.database.client._ensure_session()
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1816, in _ensure_session
    return self.__start_session(True, causal_consistency=False)
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1766, in __start_session
    server_session = self._get_server_session()
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1802, in _get_server_session
    return self._topology.get_server_session()
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/topology.py", line 488, in get_server_session
    None)
  File "/glade/u/home/ggantos/miniconda3/envs/sherpa/lib/python3.6/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
    (self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: casper24:27001: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 5f5258b31496961c60064812, topology_type: Single, servers: [<ServerDescription ('casper24', 27001) server_type: Unknown, rtt: None, error=AutoReconnect('casper24:27001: [Errno 111] Connection refused',)>]>
bluevex commented 4 years ago

I got this error when something was using the port. Usually it's the previous instance of the sherpa mongodb database. I had to write a script to manually delete the 'sherpa' database from the previous run, and kill instances of mongo.