microsoft / aerial_wildlife_detection

Tools for detecting wildlife in aerial images using active learning
MIT License
224 stars 58 forks source link

Celery worker raising IsADirectoryError upon running a task #53

Closed ryan-ntt closed 2 years ago

ryan-ntt commented 2 years ago

Hi all,

I'd appreciate some assistance if possible configuring the Celery worker service. I'm experiencing an issue where Celery is passing the project working directory as a variable into a pidfile checking method in multi.py, which is subsequently raising an error. The full traceback is below:


Traceback (most recent call last):
  File "/home/azureuser/anaconda3/envs/aide/bin/celery", line 8, in <module>
    sys.exit(main())
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/__main__.py", line 15, in main
    sys.exit(_main())
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/celery.py", line 213, in main
    return celery(auto_envvar_prefix="CELERY")
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/base.py", line 133, in caller
    return f(ctx, *args, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 480, in multi
    return cmd.execute_from_commandline(args)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 271, in execute_from_commandli>
    return self.call_command(argv[0], argv[1:])
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 278, in call_command
    return self.commands[command](*argv) or EX_OK
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 148, in _inner
    return fun(self, *args, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 166, in _inner
    return fun(self, cluster, sig, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/bin/multi.py", line 303, in stopwait
    return cluster.stopwait(sig=sig, **kwargs)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/apps/multi.py", line 448, in stopwait
    return self._stop_nodes(retry=retry, on_down=callback, sig=sig)
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/apps/multi.py", line 452, in _stop_nodes
    nodes = list(self.getpids(on_down=on_down))
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/apps/multi.py", line 494, in getpids
    if node.pid:
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/apps/multi.py", line 260, in pid
    return Pidfile(self.pidfile).read_pid()
  File "/home/azureuser/anaconda3/envs/aide/lib/python3.8/site-packages/celery/platforms.py", line 168, in read_pid
    with open(self.path) as fh:
IsADirectoryError: [Errno 21] Is a directory: '/home/azureuser/aerial_wildlife_detection'

Our environment file /etc/default/celeryd_aide is as follows:

CELERYD_NODES="aide@%h"
CELERY_BIN="/home/azureuser/anaconda3/envs/aide/bin/celery"
CELERY_APP="celery_worker"
CELERYD_CHDIR="/home/azureuser/aerial_wildlife_detection"
CELERYD_USER="aide_celery"
CELERYD_GROUP="aide"
CELERYD_LOG_LEVEL="INFO"
CELERYD_PID_FILE="/var/run/celery/%n.pid"
CELERYD_LOG_FILE="/var/log/celery/celeryd_aide.log"
CELERYBEAT_PID_FILE="/tmp/celeryd_aide_beat.pid"
CELERYBEAT_PID_FILE="/var/run/celery/beat.pid"
CELERYBEAT_LOG_FILE="/var/log/celery/celeryd_aide_beat.log"
CELERYD_OPTS=""
CELERY_CREATE_DIRS=1
CELERYBEAT_CHDIR="/home/azureuser/aerial_wildlife_detection"
CELERYBEAT_OPTS="-s /tmp"

# AIDE environment variables
AIDE_MODULES=LabelUI,AIController,AIWorker,FileServer
PYTHONPATH=/home/azureuser/aerial_wildlife_detection

The systemd service file is as follows:

[Unit]
Description=Celery Service for AIDE AIWorker
After=network.target
After=rabbitmq-server.service
After=redis.service
After=postgresql.service

[Service]
Type=forking
User=aide_celery
Group=aide
EnvironmentFile=/etc/default/celeryd_aide
WorkingDirectory=/home/azureuser/aerial_wildlife_detection
ExecStart=/bin/sh -c '${CELERY_BIN} -A $CELERY_APP multi start $CELERYD_NODES     --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE}     --loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS'
ExecStop=/bin/sh -c '${CELERY_BIN} multi stopwait $CELERYD_NODES     --pidfile= --loglevel="${CELERYD_LOG_LEVEL}"'
ExecReload=/bin/sh -c '${CELERY_BIN} -A $CELERY_APP multi restart $CELERYD_NODES     --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE}     --loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS'
Environment=AIDE_CONFIG_PATH=/home/azureuser/aerial_wildlife_detection/config/settings.ini
Environment=AIDE_MODULES=LabelUI,AIController,AIWorker,FileServer
Environment=PYTHONPATH=/home/azureuser/aerial_wildlife_detection
Restart=always

[Install]
WantedBy=multi-user.target

I've done some searching for this issue to see if it is a bug with Celery itself but haven't found anything decisive. My main experience with Celery is through Django and I often have very simple Celery app definitions. My OS is Ubuntu 20.04.3 and I'm using commit 087aa40bf4938346b4bb5c98038d4027e70c8b53.

Package versions:

absl-py==0.15.0
amqp==5.0.6
antlr4-python3-runtime==4.8
appdirs==1.4.4
bcrypt==3.2.0
billiard==3.6.4.0
black==21.4b2
bottle==0.12.19
cachetools==4.2.4
celery==5.1.2
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
click==7.1.2
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
cloudpickle==2.0.0
cryptography==35.0.0
cycler==0.11.0
Cython==0.29.24
detectron2==0.6+cu111
future==0.18.2
fvcore==0.1.5.post20211023
google-auth==2.3.3
google-auth-oauthlib==0.4.6
grpcio==1.41.1
gunicorn==20.1.0
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
iopath==0.1.9
kiwisolver==1.3.2
kombu==5.2.0
Markdown==3.3.4
matplotlib==3.4.3
msgpack==1.0.2
mypy-extensions==0.4.3
netifaces==0.11.0
numpy==1.21.4
oauthlib==3.1.1
omegaconf==2.1.1
opencv-python==4.5.4.58
pathspec==0.9.0
Pillow==8.4.0
portalocker==2.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psycopg2-binary==2.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.2
pycparser==2.20
pydot==1.4.2
pyparsing==3.0.4
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
redis==3.5.3
regex==2021.11.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.16.0
tabulate==0.8.9
tensorboard==2.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
termcolor==1.1.0
toml==0.10.2
torch==1.9.0+cu111
torchvision==0.10.0+cu111
tqdm==4.62.3
typing-extensions==3.10.0.2
urllib3==1.26.7
vine==5.0.0
wcwidth==0.2.5
Werkzeug==2.0.2
yacs==0.1.8
zipp==3.6.0

Any pointers would be much appreciated

bkellenb commented 2 years ago

Hello!

Thank you for opening the issue. I believe I have been able to replicate the error; it lies in the way Celery is invoked in the systemd service: /bin/sh -c '${CELERY_BIN} … cannot work because ${CELERY_BIN} is a Python file and not a shell script. I checked my logs and got the same error, but did not find them before, because curiously enough the Celery daemon worked nonetheless in my tests.

The quickest solution for you would be to do the following:

  1. Log in as root: sudo -s
  2. Execute these lines, one by one:
    sed -i "s/\/bin\/sh -c '/\/home\/azureuser\/anaconda3\/envs\/aide\/bin\/python /g" /etc/systemd/system/aide-worker.service
    sed -i "s/'$//g" /etc/systemd/system/aide-worker.service
    systemctl daemon-reload
    service aide-worker stop
    service aide-worker start

Sanity check: your systemd service file should now look like this:

[Unit]
Description=Celery Service for AIDE AIWorker
After=network.target
After=rabbitmq-server.service
After=redis.service
After=postgresql.service

[Service]
Type=forking
User=aide_celery
Group=aide
EnvironmentFile=/etc/default/celeryd_aide
WorkingDirectory=/home/azureuser/aerial_wildlife_detection
ExecStart=/home/azureuser/anaconda3/envs/aide/bin/python ${CELERY_BIN} -A $CELERY_APP multi start $CELERYD_NODES     --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE}     --loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS
ExecStop=/home/azureuser/anaconda3/envs/aide/bin/python ${CELERY_BIN} multi stopwait $CELERYD_NODES     --pidfile= --loglevel="${CELERYD_LOG_LEVEL}"
ExecReload=/home/azureuser/anaconda3/envs/aide/bin/python ${CELERY_BIN} -A $CELERY_APP multi restart $CELERYD_NODES     --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE}     --loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS
Environment=AIDE_CONFIG_PATH=/home/azureuser/aerial_wildlife_detection/config/settings.ini
Environment=AIDE_MODULES=LabelUI,AIController,AIWorker,FileServer
Environment=PYTHONPATH=/home/azureuser/aerial_wildlife_detection
Restart=always

[Install]
WantedBy=multi-user.target

Essentially, the /bin/sh -c should be replaced with the path to the correct Python executable and all the single quotes removed.

Also, an invoke of journalctl -xe should indicate that there is no more "is a directory" errors after restarting the worker daemon process.

The rest looks fine (although note that environment variable CELERYBEAT_PID_FILE is specified twice in your environment file).

Please let me know if this helped. I will update the installer script with the next commit.

bkellenb commented 2 years ago

Fixed installer in commit e9b37dd.

ryan-ntt commented 2 years ago

Thanks for looking into this!

I've updated the service files to execute via python rather than the celery. The tasks are running now which is great. I've noticed some subsequent unhandled errors in the worker logs now, I'll dig into it further and work out if it's user error before opening another issue.

Really appreciate the help.