ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.76k stars 5.74k forks source link

[Serve] OSError: [Errno 9] Bad file descriptor when using syslog in Ray Serve #34497

Open RAbraham opened 1 year ago

RAbraham commented 1 year ago

What happened + What you expected to happen

Hi, When I configure syslog for Ray Serve, it works during initialization of the Ray Serve Deployment but gives the above error when an endpoint is called.

We need to setup syslog for centralized logging for SOC2 certification.

Versions / Dependencies

Ray 2.3.0 Ray Serve: 2.3.0 ubuntu:20.04 Python 3.8.10

Reproduction script

systemctl start rsyslog && systemctl enable rsyslog

apt-get install -y python-is-python3 pip install ray==2.3.0 pip install "ray[serve]==2.3.0"


- Update rsyslog configiuration

`vim /etc/rsyslog.conf` and copy below
```shell
# /etc/rsyslog.conf configuration file for rsyslog
#
# For more information install rsyslog-doc and see
# /usr/share/doc/rsyslog-doc/html/configuration/index.html
#
# Default logging rules can be found in /etc/rsyslog.d/50-default.conf

#################
#### MODULES ####
#################

module(load="imuxsock") # provides support for local system logging
#module(load="immark")  # provides --MARK-- message capability

# provides UDP syslog reception
module(load="imudp")
input(type="imudp" port="514")

# provides TCP syslog reception
module(load="imtcp")
input(type="imtcp" port="514")

$template RemInputLogs, "/var/log/remotelogs/%FROMHOST-IP%/%PROGRAMNAME%.log"
*.* ?RemInputLogs

# provides kernel logging support and enable non-kernel klog messages
module(load="imklog" permitnonkernelfacility="on")

###########################
#### GLOBAL DIRECTIVES ####
###########################

#
# Use traditional timestamp format.
# To enable high precision timestamps, comment out the following line.
#
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# Filter duplicated messages
$RepeatedMsgReduction on

#
# Set the default permissions for all log files.
#
$FileOwner syslog
$FileGroup adm
$FileCreateMode 0640
$DirCreateMode 0755
$Umask 0022
$PrivDropToUser syslog
$PrivDropToGroup syslog

#
# Where to place spool and state files
#
$WorkDirectory /var/spool/rsyslog

#
# Include all config files in /etc/rsyslog.d/
#
$IncludeConfig /etc/rsyslog.d/*.conf

It should show

...
     Active: active (running) 

Just check

ss -antpl | grep 514

should show something like


LISTEN 0      25                0.0.0.0:514        0.0.0.0:*    users:(("rsyslogd",pid=916831,fd=7))                   
LISTEN 0      25                   [::]:514           [::]:*    users:(("rsyslogd",pid=916831,fd=8))   

Next, run the following commands, it should show some issue in docker but it's fine

ufw allow 514/tcp
ufw allow 514/udp

Open another terminal with

docker exec -it rsyslog-container /bin/bash

Running the following logger command should show the text in the tailed logs in the other terminal

logger "RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR"

Create a file called simple_logging.py

# simple_logging.py

import logging
import logging.handlers
import os

def get_logger(queue_name=__name__):
    logger = logging.getLogger(queue_name)
    logger.setLevel(logging.DEBUG)

    syslog_address = (
        os.environ.get("SYSLOG_HOST", "localhost"),
        int(os.environ.get("SYSLOG_PORT", "514")),
    )
    syslog_handler = logging.handlers.SysLogHandler(address=syslog_address)
    logger.addHandler(syslog_handler)

    return logger

if __name__ == '__main__':
    import datetime
    l = get_logger("rajiv_bht42")
    msg = f"FROM simple_logging_util ================================ APA {datetime.datetime.utcnow()} ==========================="
    print(msg)
    l.info(msg)

Run python simple_logging.py

This will print to stdout and to the tailed logs. so get_logger works

Now, we call get_logger from simple_logging.py within a ray serve environment

@serve.deployment(route_prefix="/") @serve.ingress(app) class ServeTrial: def init(self): _logger = get_logger("attribute_prediction_api") self.logger = _logger msg = f"From Init ================================ Syslog Trial {datetime.datetime.utcnow()} ===========================" self.logger.info(msg) pass

@app.get("/probe")
async def probe(self, request: Request) -> JSONResponse:

    msg = f"From Probe ================================ Syslog Trial {datetime.datetime.utcnow()} ==========================="
    self.logger.info(msg)
    return JSONResponse(dict(result="healthy"), 200)

work_api = ServeTrial.bind()


- Start Ray 

`ray stop; ray start --head `

- Start the app
`serve run app:work_api`

You should see  in the tailed logs:
`From Init ================================ Syslog Trial <some date> ===========================`

__So it works at initialization__

- Open a third terminal with 
```shell
docker exec -it rsyslog-container /bin/bash

Run


curl -X GET  http://localhost:8000/probe

You should see it error out like:


(HTTPProxyActor pid=11130) INFO 2023-04-17 21:48:04,265 http_proxy 172.19.0.2 http_proxy.py:373 - GET / 200 18.9ms
(ServeReplica:ServeTrial pid=11198) --- Logging error ---
(ServeReplica:ServeTrial pid=11198) Traceback (most recent call last):
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/logging/handlers.py", line 940, in emit
(ServeReplica:ServeTrial pid=11198)     self.socket.sendto(msg, self.address)
(ServeReplica:ServeTrial pid=11198) OSError: [Errno 9] Bad file descriptor
(ServeReplica:ServeTrial pid=11198) Call stack:
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
(ServeReplica:ServeTrial pid=11198)     self._bootstrap_inner()
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(ServeReplica:ServeTrial pid=11198)     self.run()
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/threading.py", line 870, in run
(ServeReplica:ServeTrial pid=11198)     self._target(*self._args, **self._kwargs)
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
(ServeReplica:ServeTrial pid=11198)     self._run_once()
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
(ServeReplica:ServeTrial pid=11198)     handle._run()
(ServeReplica:ServeTrial pid=11198)   File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
(ServeReplica:ServeTrial pid=11198)     self._context.run(self._callback, *self._args)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/ray/util/tracing/tracing_helper.py", line 498, in _resume_span
(ServeReplica:ServeTrial pid=11198)     return await method(self, *_args, **_kwargs)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/ray/serve/_private/replica.py", line 184, in handle_request
(ServeReplica:ServeTrial pid=11198)     return await self.replica.handle_request(query)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/ray/serve/_private/replica.py", line 513, in handle_request
(ServeReplica:ServeTrial pid=11198)     result, success = await self.invoke_single(request)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/ray/serve/_private/replica.py", line 453, in invoke_single
(ServeReplica:ServeTrial pid=11198)     result = await method_to_call(*args, **kwargs)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/ray/serve/api.py", line 222, in __call__
(ServeReplica:ServeTrial pid=11198)     await self._serve_app(
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 276, in __call__
(ServeReplica:ServeTrial pid=11198)     await super().__call__(scope, receive, send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
(ServeReplica:ServeTrial pid=11198)     await self.middleware_stack(scope, receive, send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
(ServeReplica:ServeTrial pid=11198)     await self.app(scope, receive, _send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
(ServeReplica:ServeTrial pid=11198)     await self.app(scope, receive, sender)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(ServeReplica:ServeTrial pid=11198)     await self.app(scope, receive, send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
(ServeReplica:ServeTrial pid=11198)     await route.handle(scope, receive, send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
(ServeReplica:ServeTrial pid=11198)     await self.app(scope, receive, send)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
(ServeReplica:ServeTrial pid=11198)     response = await func(request)
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 237, in app
(ServeReplica:ServeTrial pid=11198)     raw_response = await run_endpoint_function(
(ServeReplica:ServeTrial pid=11198)   File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 163, in run_endpoint_function
(ServeReplica:ServeTrial pid=11198)     return await dependant.call(**values)
(ServeReplica:ServeTrial pid=11198)   File "/./app.py", line 24, in probe
(ServeReplica:ServeTrial pid=11198)     self.logger.info(msg)
(ServeReplica:ServeTrial pid=11198) Message: 'From Probe ================================ APA 2023-04-17 21:48:04.258921 ==========================='
(ServeReplica:ServeTrial pid=11198) Arguments: ()
(ServeReplica:ServeTrial pid=11198) INFO 2023-04-17 21:48:04,263 ServeTrial ServeTrial#bjLqRw replica.py:518 - HANDLE __call__ OK 11.0ms

Issue Severity

High: It blocks me from completing my task.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.