pgiri / dispy

Distributed and Parallel Computing Framework with / for Python
https://dispy.org
Other
266 stars 55 forks source link

It should be possible to prevent automatic discovery of nodes... #157

Open UnitedMarsupials-zz opened 6 years ago

UnitedMarsupials-zz commented 6 years ago

In our environment, the jobs rely heavily on many different aspects of the local configuration, including versions of the (non-Python) software, which Dispy can not -- and is not asked to -- manage.

Unfortunately, because Dev, QA, and Production machines are sometimes on the same network and can hear each other's broadcasts, we've had "crosspollination" -- a server from Dev, for example, being "discovered" and automatically added to a QA-cluster.

At best, this causes unwelcome errors:

2018-11-06 14:48:46 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 14:58:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:08:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:18:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:28:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:38:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:48:47 dispy - Ignoring pulse message from 10.78.16.162 2018-11-06 15:48:47 dispy - Discovered 10.78.16.162:51348 (r00cb6n0c) with 16 cpus 2018-11-06 15:48:49 dispy - Running job 51310424 on 10.78.16.162 2018-11-06 15:48:49 dispy - Running job 100000160 / 51310424 on 10.78.16.162 (busy: 1 / 1) 2018-11-06 15:49:01 dispy - Received reply for job 100000160 / 51310424 from 10.78.16.162 2018-11-06 15:49:01 dispy - Job 100000160 on 10.78.16.162: Traceback (most recent call last): 2018-11-06 15:49:01 dispy - Closing node 10.78.16.162 for processTask / 1541533183926 2018-11-06 15:49:01 dispy - Running job 51311960 on 10.78.16.162 2018-11-06 15:49:01 dispy - Failed to run 51311960 on 10.78.16.162: bytearray(b'NAK (invalid computation 1541533183926)') 2018-11-06 15:49:01 dispy - Failed to run job 51311960 on 10.78.16.162 for computation processTask

At worst, it could introduce subtle inaccuracies, because of the differences in configuration.

Though this is a great feature in general, it should be possible to disable it -- relying on the list of nodes explicitly given to the JobCluster method only.

pgiri commented 5 years ago

You can use secret for partitioning nodes so clients with a secret can only use nodes with matching secret. For example, start nodes in QA with --secret=qa and QA clients with secret=qa. Then QA nodes can't be used by other clients.

Alternately, you can also sub-class NodeAllocate to customize node allocation by overriding allocate method. This can be used to allocate nodes only desired nodes (e.g., returning 0 for any node not wanted).

UnitedMarsupials-zz commented 5 years ago

Yes, this would work for us -- unless someone brings up a node with misconfigured "secret". But I would've thought, disabling the client promiscuity -- just don't listen to announcements -- would be a trivial flag to add...

pgiri commented 5 years ago

If it is trivial, submit a patch.