pgiri / dispy

Distributed and Parallel Computing Framework with / for Python
https://dispy.org
Other
260 stars 55 forks source link

Computations crashing out - 'NoneType' object is not iterable (v4.12.3, python3.6.8) #217

Closed lewisoshaughnessy closed 3 years ago

lewisoshaughnessy commented 3 years ago

Hi guys,

I'm trying to set up a very simple proof-of-concept program to demonstrate dispy. I have my Windows 10 machine (192.168.254.202) and my network-bridged Ubuntu VM (192.168.254.157). The VM is running dispyscheduler.py and dispynode.py using

python3 dispynode.py -d --clean --ip_addr="192.168.254.157" and python3 dispyscheduler.py -d --ip_addr="192.168.254.157"

And the Windows machine is running a basic compute.py script which invokes the SharedJobCluster and basically tells the server to wait/sleep in increasingly longer steps:

def compute(sleep_time):
    import time
    import socket
    time.sleep(sleep_time)
    host_name = socket.gethostname()
    return host_name, sleep_time

if __name__ == '__main__':
    import dispy
    import random
    import socket
    import time
    cluster = dispy.SharedJobCluster(compute, scheduler_node="192.168.254.157", nodes=["192.168.254.157"], ext_ip_addr="192.168.254.202", ip_addr="192.168.254.202")
    jobs = []
    for i in range(16):
        time.sleep(5)  # included this as recommended by another forum, but didn't fix the problem
        job = cluster.submit(random.randint(5, 20))
        job.id = i
        jobs.append(job)
    cluster.wait()
    for job in jobs:
        host, sleep_time = job()
        print('%s executed job %s at %s with %s' % (host, job.id, job.start_time, sleep_time))
    cluster.print_status()
    cluster.close()

I first run the scheduler, then the node (both on VM). The node is discovered by the scheduler, then I run my compute.py script. As soon as this happens, the scheduler discovers the incoming computation, and submits it to the single dispynode. When this happens the dispynode console throws an exception:

2020-12-11 01:41:52 dispynode - version: 4.12.3 (Python 3.6.8), PID: 2823
2020-12-11 01:41:52 dispynode - Files will be saved under "/tmp/dispy/node"
2020-12-11 01:41:52 pycos - version 4.10.0 with epoll I/O notifier
2020-12-11 01:41:52 dispynode - "ubuntu" serving 1 cpus
2020-12-11 01:41:52 dispynode - TCP server at 192.168.254.157:61591

Enter "quit" or "exit" to terminate dispynode,
  "stop" to stop service, "start" to restart service,
  "release" to check and close computation,
  "cpus" to change CPUs used, anything else to get status: 

2020-12-11 01:42:30 dispynode - New computation "9f5a303a29dc28dfe03d4c8b304924d702840b50" from 192.168.254.157
2020-12-11 01:42:30 pycos - uncaught exception in tcp_req/140461292510552:
Traceback (most recent call last):
  File "dispynode.py", line 1721, in tcp_req
    client, resp = yield setup_computation(msg, task=task)
TypeError: 'NoneType' object is not iterable

The scheduler continues to submit all 16 computations as normal:

2020-12-11 01:41:39 dispyscheduler - version: 4.12.3 (Python 3.6.8), PID: 2816
2020-12-11 01:41:39 pycos - version 4.10.0 with epoll I/O notifier
Enter "quit" or "exit" to terminate scheduler, anything else to get status: 2020-12-11 01:41:39 dispyscheduler - TCP server at 192.168.254.157:61590
2020-12-11 01:41:39 dispyscheduler - Scheduler at 192.168.254.157:61592
2020-12-11 01:41:54 dispyscheduler - Discovered 192.168.254.157:61591 (ubuntu) with 1 cpus
2020-12-11 01:42:30 dispyscheduler - New computation 140589834931504: compute, /tmp/dispy/scheduler/192.168.254.202/compute_orx13j0d
2020-12-11 01:42:35 dispyscheduler - Submitted job 140589834801464 / 1607679755.3374403
2020-12-11 01:42:40 dispyscheduler - Submitted job 140589833484472 / 1607679760.35386
2020-12-11 01:42:45 dispyscheduler - Submitted job 140589824788304 / 1607679765.369267
2020-12-11 01:42:50 dispyscheduler - Submitted job 140589824788424 / 1607679770.3848724
2020-12-11 01:42:55 dispyscheduler - Submitted job 140589824788544 / 1607679775.4004657
2020-12-11 01:43:00 dispyscheduler - Submitted job 140589824788664 / 1607679780.4161506
2020-12-11 01:43:05 dispyscheduler - Submitted job 140589824788784 / 1607679785.431747
2020-12-11 01:43:10 dispyscheduler - Submitted job 140589824788904 / 1607679790.4473479
2020-12-11 01:43:15 dispyscheduler - Submitted job 140589824789024 / 1607679795.4600875
2020-12-11 01:43:20 dispyscheduler - Submitted job 140589824789144 / 1607679800.472686
2020-12-11 01:43:25 dispyscheduler - Submitted job 140589824789264 / 1607679805.4852734
2020-12-11 01:43:30 dispyscheduler - Submitted job 140589824789384 / 1607679810.4890156
2020-12-11 01:43:35 dispyscheduler - Submitted job 140589824789504 / 1607679815.502499
2020-12-11 01:43:40 dispyscheduler - Submitted job 140589824789624 / 1607679820.5180833
2020-12-11 01:43:45 dispyscheduler - Submitted job 140589824789744 / 1607679825.5335934
2020-12-11 01:43:50 dispyscheduler - Submitted job 140589824789864 / 1607679830.5510707

I can't find any reference to this exact error, although I saw that the client, resp = yield setup_computation(msg, task=task) line has caused issues before. I have a feeling it's either a firewall issue or I haven't properly configured the node and/or scheduler, or a bug ;)

Here is a list of things I have done to try and fix:

If there's any more information I can give to help sort this I'm more than happy to oblige! I'd appreciate any information at all on this. I'm a big fan of the project and really want to get it working! Cheers all.

pgiri commented 3 years ago

Is it possible to test with later version of Python, either 3.7 or 3.8? I have looked through that setup_computation function and I don't see how it can return None as the trace above shows. If you can't try later Python, can you change 'dispynode.py' line 1496 (end of setup_computation so it is::

dispynode_logger.info('setup finished')
raise StopIteration((None, 'ACK'))

That is, current line that is raise StopIteration(None, 'ACK') is replaced by above lines. Note that replacement uses explicit tuple. I think this may be the problem with 3.6.8 although I thought I tested it at that time (but not sure).

pgiri commented 3 years ago

I tested with Python 3.6.8 and it works as expected. Can you run following program and show the output:

import pycos

def g(task=None):
    yield task.sleep(1)
    raise StopIteration(1, 2)

def f(task=None):
    res = yield g(task=task)
    pycos.logger.info('res: %s / %s', type(res), str(res))

pycos.Task(f)
lewisoshaughnessy commented 3 years ago

Hi pgiri,

Thank you for your response! I will test this when I get in to the office Monday.

Really appreciate the help. Will let you know how it goes :)

Lewis.

lewisoshaughnessy commented 3 years ago

Hi pgiri,

Following your advice I attempted to use a later version of Python but ran into Pickle version errors. Reverting to 3.6.8 I decided to run the scheduler, node and compute.py on the same machine. After an ambiguous error from the Scheduler ("Ignored message"), I set the compute.py port (using client_port in the SharedJobCluster) to 0 (random port) and everything started working for the first time ever!

I can surmise that executing over Windows and Linux causes some conflicts on the versions I'm running, but when all are running on the Windows machine they work flawlessly. So I can confirm that 4.12.3 and Python 3.6.8 works on Windows (with the Pickle version patch).

Thank you for your time and effort, this is a great project.

Lewis.

pgiri commented 3 years ago

In case others may encounter these issues: Pickle error is due to change of pickle protocol in Python 3.7. If dispynode, client (and scheduler in case of SharedScheduler) are run with Python 3.7, it should work. There is a way to interoperate with different Python versions by setting PickleProtocolVersion in pycos's configuration (see pycos documentation).

dispy should work fine in mixed environments (e.g., dispynode under Linux and client under Windows) if exchanging native Python objects. If data exchanged involve user objects, it may not work. However, pycos should work in mixed environment with user objects. If there is an issue in mixed environment with dispy, it should be easy to fix. I will look into it after next version of pycos is released.