ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.58k stars 1k forks source link

ERROR | DB Error updating record '9472f58e-6f89-46a8-be69-10e09da7e2e1' #203

Open parashardhapola opened 7 years ago

parashardhapola commented 7 years ago

Hi,

I get following error when I try to run a simple test job:

Traceback (most recent call last):
File "/home/parashar/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/hub.py", line 687, in save_queue_result
self.db.update_record(msg_id, result)
File "/home/parashar/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/dictdb.py", line 232, in update_record
raise KeyError("Record %r has been culled for size" % msg_id)
KeyError: "Record '9472f58e-6f89-46a8-be69-10e09da7e2e1' has been culled for size"

I'm ran following code:

from ipyparallel import Client

def get_square(num):
    return num**2

_RC = Client()
_DVIEW = _RC[:]    
ar = _DVIEW.map_async(get_square, range(100000000))
ar.wait_interactive()

I have run 30 engines on two different hosts by running ipcluster engines -n 30 and have run ipcontroller --ip="*" on he host running jupyter notebook wait_interactive output hangs at 59/60.

Please check if this error can be reproduced.

minrk commented 7 years ago

Thanks, I'll investigate.

littlegreenbean33 commented 7 years ago

similar issue here

I am on ver 5.2 with anaconda installation

per
    return fn(*args, **kwargs)
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/scheduler.py", line 325, in <lambda>
    lambda : self.handle_stranded_tasks(uid),
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/scheduler.py", line 335, in handle_stranded_tasks
    for msg_id in lost.keys():
RuntimeError: dictionary changed size during iteration
2017-02-25 17:08:58.400 [IPControllerApp] task::task 'e7647038-edff-4814-8939-84afced09336' finished on 7
2017-02-25 17:08:58.401 [IPControllerApp] ERROR | DB Error saving task request 'e7647038-edff-4814-8939-84afced09336'
Traceback (most recent call last):
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/hub.py", line 794, in save_task_result
    self.db.update_record(msg_id, result)
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/dictdb.py", line 232, in update_record
    raise KeyError("Record %r has been culled for size" % msg_id)
KeyError: "Record 'e7647038-edff-4814-8939-84afced09336' has been culled for size"
2017-02-25 17:08:58.402 [IPControllerApp] task::task 'fce1ddc0-c360-43eb-902b-0477bd259dba' finished on 8
2017-02-25 17:08:58.402 [IPControllerApp] ERROR | DB Error saving task request 'fce1ddc0-c360-43eb-902b-0477bd259dba'
Traceback (most recent call last):
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/hub.py", line 794, in save_task_result
    self.db.update_record(msg_id, result)
  File "/home/julian/anaconda3/lib/python3.5/site-packages/ipyparallel/controller/dictdb.py", line 232, in update_record

controller runs on Linux while clients run on a variety of linux/windows machines.

jayzed82 commented 7 years ago

Hi. Any updates on this issue? I'm having the same problem sometimes.

littlegreenbean33 commented 7 years ago

@jayzed82

My issues were my fault. I was sending more than 1024 tasks in parallel. You need to manually change the limit if you want to go beyond that limit

Have you checked if you try to fill the queue with more than 1024 tasks ?

jayzed82 commented 7 years ago

Thank you @littlegreenbean33. That is my problem, I have a queue longer than 1024 task. I didn't know there was a limit. How do you increase it?

littlegreenbean33 commented 7 years ago

Look for 1024 or the text in the error report in the code. You will find informative comments as well inside. There was some balance to achieve with regards to memory usage and 1024 probably sound like a good number.

kostrykin commented 7 years ago

@littlegreenbean33 I'm not quite sure what you mean. Can you point us more specifically?

And what does actually happen when we encounter these ERROR | DB Error saving task request messages? Are the computation results going to be faulty, and hence useless? Then why isn't a warning or something more visible shown on client-side, i.e., in IPython/Jupyter? Or can that "error" just be ignored as the hub handles it somehow magically?

littlegreenbean33 commented 7 years ago

if your task queue grows above 1024 bad things happen. Don't ignore the error. It means tasks won't be performed.

kostrykin commented 7 years ago

So does this actually mean that IPyParallel cannot have more than 1024 tasks queued? Then, why there is no error, or at least a warning? If you run ipcluster in --daemon mode, you won't even be able to notice that! And how can we lift that limit?

I've just run a quick test to see what happens if I submit more than 1024 tasks. In this test, I only have a single engine, hence the task queue should be about 2047 tasks in size, before the first task is finished:

import ipyparallel as ipp
import numpy as np

ipp_client = ipp.Client()
ipp_client[:].use_dill().get()

def f(ms):
    def _f(x):
        if ms > 0:
            import time
            time.sleep(ms * 1e-3)
        return x * 2
    return _f

data   = range(2048)
result = ipp_client[:].map(f(100), data).get()
print(np.allclose(result, map(f(0), data)))

This works like a charm. How does that match with your statement, that the task queue cannot grow beyond the size of 1024 tasks? @littlegreenbean33

minrk commented 7 years ago

It means tasks won't be performed.

It does not mean that. This error does not affect execution or results during normal execution. The only thing affected is the result cache in the Hub, which can be used for delayed retrieval by id. If you are not using delayed retrieval (client.get_result(msg_ids) instead of asyncresult.get()), there should be no user-visible effect.

The default cache of results in the Hub is an in-memory DictDB, with a few limits. You can increase those limits, or tell the controller to use sqlite or mongodb to store these things out of memory. If you aren't using delayed retrieval at all, you can use NoDB to disable result caching entirely.

kostrykin commented 7 years ago

Thanks a lot for that clarification @minrk