gmontamat / gentun

Hyperparameter tuning for machine learning models using a distributed genetic algorithm
Apache License 2.0
83 stars 22 forks source link

Connection failure on second generation #18

Closed ahmadmobeen closed 4 weeks ago

ahmadmobeen commented 4 years ago

@ahmadmobeen I was able to replicate your problem, thanks for reporting it. It's due to pika updating its API. Please use pika==1.1.0 (which should be compatible with your rabbitmq version) and the dev branch code of client.py and server.py. I will merge with master sometime later.

edit: I'm also working on the dockerfile for the gentun server to simplify running the distributed version of the genetic algorithm.

I tried this method and it worked for generation # 1.

@gmontamat Thank you for your reply. I was able to run the code by doing some small changes in server.py and client.py as shown here I tried this and it worked for generation # 1.

got the following error in both cases:


Using TensorFlow backend.
Initializing a random population. Size: 20
Starting genetic algorithm...

Evaluating generation #1...
 [*] Got fitness for individual 0
 [*] Got fitness for individual 2
 [*] Got fitness for individual 1
 [*] Got fitness for individual 4
 [*] Got fitness for individual 3
 [*] Got fitness for individual 6
 [*] Got fitness for individual 8
 [*] Got fitness for individual 10
 [*] Got fitness for individual 5
 [*] Got fitness for individual 9
 [*] Got fitness for individual 12
 [*] Got fitness for individual 11
 [*] Got fitness for individual 7
 [*] Got fitness for individual 13
 [*] Got fitness for individual 16
 [*] Got fitness for individual 15
 [*] Got fitness for individual 17
 [*] Got fitness for individual 14
 [*] Got fitness for individual 19
 [*] Got fitness for individual 18
Fittest individual is:
{'S_1': '011', 'S_2': '1000010100'}
Fitness value is: 0.9978

Evaluating generation #2...
Traceback (most recent call last):
  File "/media/vip/Program/mobeen/gentun/tests/mnist_server.py", line 23, in <module>
    ga.run(50)
  File "/media/vip/Program/mobeen/gentun/gentun/algorithms.py", line 29, in run
    self.evolve_population()
  File "/media/vip/Program/mobeen/gentun/gentun/algorithms.py", line 72, in evolve_population
    fittest = self.population.get_fittest()
  File "/media/vip/Program/mobeen/gentun/gentun/server.py", line 105, in get_fittest
    self.evaluate_in_parallel()
  File "/media/vip/Program/mobeen/gentun/gentun/server.py", line 113, in evaluate_in_parallel
    RpcClient(None, None, **self.credentials).purge()
  File "/media/vip/Program/mobeen/gentun/gentun/server.py", line 28, in __init__
    self.connection = pika.BlockingConnection(self.parameters)
  File "/home/vip/anaconda3/envs/keras/lib/python3.7/site-packages/pika-1.1.0-py3.7.egg/pika/adapters/blocking_connection.py", line 359, in __init__
    self._impl = self._create_connection(parameters, _impl_class)
  File "/home/vip/anaconda3/envs/keras/lib/python3.7/site-packages/pika-1.1.0-py3.7.egg/pika/adapters/blocking_connection.py", line 450, in _create_connection
    raise self._reap_last_connection_workflow_error(error)
pika.exceptions.ProbableAuthenticationError: ConnectionClosedByBroker: (403) 'ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logfile.'

Process finished with exit code 1

Originally posted by @ahmadmobeen in https://github.com/gmontamat/gentun/issues/17#issuecomment-596339798

gmontamat commented 4 years ago

Hi again! I was not able to replicate this issue. I'm using rabbitmq 3.6.10. It looks like a permissions issue. Have you followed the steps in: https://github.com/gmontamat/gentun#basic-rabbitmq-installation-and-setup ? It seems that the server user doesn't have enough permissions to clean up the job queue

gmontamat commented 4 years ago

@ahmadmobeen to be more specific, try running:

$ sudo rabbitmqctl set_permissions -p / test ".*" ".*" ".*"

To update permissions for the test user you're using with the server code.

shehzi-khan commented 4 years ago

Hi again! I was not able to replicate this issue. I'm using rabbitmq 3.6.10. It looks like a permissions issue. Have you followed the steps in: https://github.com/gmontamat/gentun#basic-rabbitmq-installation-and-setup ? It seems that the server user doesn't have enough permissions to clean up the job queue

Actually this is not a permission problem. The real problem is that setup a new user named test on RabbitMQ and replaced the credentials at the following line. https://github.com/gmontamat/gentun/blob/d852041421ff174ff9437f7a93aeedb121813f0a/tests/mnist_server.py#L20 When the server instance of DistributedPopulation is created the default value of user guest is replated with test and first generation runs just fine. However, in subsequent generation the default values of credentials are not replaced by test user. We checked it by debugging these 2 places.

In class RpcClient(object).__init__() https://github.com/gmontamat/gentun/blob/d852041421ff174ff9437f7a93aeedb121813f0a/gentun/server.py#L26

In class DistributedPopulation(Population).__init__() https://github.com/gmontamat/gentun/blob/d852041421ff174ff9437f7a93aeedb121813f0a/gentun/server.py#L95

Both are initialized with default user guest in the 2nd generation.

gmontamat commented 4 years ago

@shehzi-khan thank you very much for reporting this bug. I overlooked this problem (will update tests to cover non-guest rabbitmq user).

gmontamat commented 4 weeks ago

The issue here boiled down to the way a new population was created in the evolve() method:

new_population = self.get_population_type()(
    self.population.get_species(), self.x_train, self.y_train, individual_list=[],
    maximize=self.population.get_fitness_criteria()
)

A method to return the type is used to instantiate a new population, and parameters of the DistributedPopulation (and any other subclass) were missed. This is now fixed in the refactor, a population.duplicate() method is used now when creating a new generation. RabbitMQ is also overkill for the simple messaging queue it needs, so redis is used instead.