lijiang2014 / thht

Tian He Throughput Computing
0 stars 0 forks source link

Test Celery v4.0 #8

Closed lijiang2014 closed 7 years ago

lijiang2014 commented 7 years ago

Test Celery v4.0 , test memory control

lijiang2014 commented 7 years ago

--max-tasks-per-child 此设置似乎并未生效 ? 也不知道有什么用

lijiang2014 commented 7 years ago

--max-memory-per-child Max memory per child setting worker unable to determine worker memory usage 可能由于 Popen 的原因,无法正常的管理 MEM . 所以这里这个实际是没有生效的

lijiang2014 commented 7 years ago

--autoscale=10,3 (always keep 3 processes, but grow to 10 if necessary). 这个是可以生效的。 似乎比较有用 。 当设置为 --autoscale=x,1 时 , 可以发现空闲时至少有 2 个 celery 进程,其中1个应该是管理进程。

这是个不错的特性,可以用来进行更灵活的负载实现: 可设置为 : --autoscale=max,min . min 对于大人物可以设置为1 。 max 可以根据 free mem 和 task mem 进行计算 。

更高级的特性可以去开发 celery 的管理进程, 强化其动态调度策略 (主要是在单节点负载较高时不再添加worker)。

lijiang2014 commented 7 years ago

kill worker :

pkill -9 -f 'celery worker'

app.control.broadcast('shutdown') # shutdown all workers app.control.broadcast('shutdown', destination='worker1@example.com')

可以用在主程序的最后

lijiang2014 commented 7 years ago

可以通过在 yhbatch 里调用 yhbatch 来申请 资源扩展 。 如果排队的任务太多 , 且没有空闲的worker 时 ,如果条件允许,则提交 yhbatch 脚本来自动申请 新的节点资源来运行 worker . 这一批worker 采用 更低的优先级 queue , 来保证先占满原来的资源才会调用。

但如何收缩资源 ? 如何设置优先级 ?

lijiang2014 commented 7 years ago

CELERY 的 worker 模块有很多可以加强的可能性 ~!

lijiang2014 commented 7 years ago

测试了 超内存负载的情况 : worker 出现错误 :

[2016-12-08 16:26:05,655: ERROR/PoolWorker-10] Task ht_celery.tasks.run_command[6be8723a-65c1-4228-9d35-160433cdd70e] raised unexpected: Exception('Exit at code:44',)
Traceback (most recent call last):
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/celery/app/trace.py", line 368, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/celery/app/trace.py", line 623, in __protected_call__
    return self.run(*args, **kwargs)
  File "/WORK/app/thht/thht/ht_celery/tasks.py", line 53, in run_command
    raise self.retry(exc = Exception("Exit at code:"  + str(retcode)  ) )
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/celery/app/task.py", line 661, in retry
    raise_with_context(exc)
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/celery/utils/serialization.py", line 267, in raise_with_context
    _raise_with_context(exc, exc_info[1])
  File "<string>", line 1, in _raise_with_context
Exception: Exit at code:44

即44错误 。 而后则会有大量的

[2016-12-08 16:26:28,383: ERROR/PoolWorker-19] Pool process <celery.concurrency.asynpool.Worker object at 0x2b162bea0630> error: BrokenPipeError(32, 'Broken pipe')
Traceback (most recent call last):
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/pool.py", line 363, in workloop
    put((READY, (job, i, result, inqW_fd)))
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
    self.send_payload(ForkingPickler.dumps(obj))
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/queues.py", line 358, in send_payload
    self._writer.send_bytes(value)
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/connection.py", line 229, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/connection.py", line 455, in _send_bytes
    self._send(header + buf)
  File "/HOME/nscc-gz_jiangli/.virtualenvs/py34env-rh6.5/lib/python3.4/site-packages/billiard/connection.py", line 408, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

即出现了 BrokenPipeError , worker 失败 。

即 超内存会触发程序错误,导致retry . 如果 worker 数量太多时 会一直出问题 !

lijiang2014 commented 7 years ago

测试了 超线程负载的情况 :

不会出错,但实际性能会低于线程设定值。

lijiang2014 commented 7 years ago

cat /proc/loadavg  [root@opendigest root]# uptime   7:51pm up 2 days, 5:43, 2 users, load average: 8.13, 5.90, 4.94   命令输出的最后内容表示在过去的1、5、15分钟内运行队列中的平均进程数量。