numaproj / numaflow-python

Numaflow Python SDK
Apache License 2.0
52 stars 18 forks source link

feat: UDF container should crash on exceptions #160

Closed kohlisid closed 5 months ago

kohlisid commented 5 months ago

Requirement On any exception received in the user code or the execution make the UDF container restart.

Use Case

Flow For all UDF types apart from Reduce, Reducestreamer we do not expect the Numa container to crash when the UDF container crashes.

Testing

Testing logs UDF container logs

2024-05-01 07:44:26 CRITICAL UDFError, re-raising the error
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/task_manager.py", line 198, in __invoke_reduce
    _ = await new_instance(keys, request_iterator, output, md)
  File "/app/example.py", line 39, in handler
    sys.exit(1)
SystemExit: 1
CRITICAL:pynumaflow._constants:Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/task_manager.py", line 198, in __invoke_reduce
    _ = await new_instance(keys, request_iterator, output, md)
  File "/app/example.py", line 39, in handler
    sys.exit(1)
SystemExit: 1
2024-05-01 07:44:26 CRITICAL 1
CRITICAL:pynumaflow._constants:1
2024-05-01 07:44:26 CRITICAL Reduce Streaming Error: AbortError('Locally aborted.')
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/task_manager.py", line 210, in process_input_stream
    async for request in request_iterator:
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/async_servicer.py", line 27, in datum_generator
    async for d in request_iterator:
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 607, in _async_message_receiver
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 132, in read
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 82, in grpc._cython.cygrpc.RPCState.raise_for_termination
grpc._cython.cygrpc.AbortError: Locally aborted.
CRITICAL:pynumaflow._constants:Reduce Streaming Error: AbortError('Locally aborted.')
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/task_manager.py", line 210, in process_input_stream
    async for request in request_iterator:
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pynumaflow/reducestreamer/servicer/async_servicer.py", line 27, in datum_generator
    async for d in request_iterator:
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 607, in _async_message_receiver
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 132, in read
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 82, in grpc._cython.cygrpc.RPCState.raise_for_termination
grpc._cython.cygrpc.AbortError: Locally aborted.
INFO:pynumaflow._constants:Killing process: Got exception SystemExit(1)
2024-05-01 07:44:26 INFO     Killing process: Got exception SystemExit(1)
Killed

Numa Container logs

2024/04/30 07:56:22 failed ReduceFn stream.Recv(): Retryable: SystemExit(1)
2024/04/30 07:56:22 failed ReduceFn stream.Recv(): Retryable: SystemExit(1)
{"level":"panic","ts":"2024-04-30T07:56:22.400626709Z","logger":"numaflow.ReduceUDF-processor","caller":"pnf/pnf.go:155","msg":"Got an error while invoking ApplyReduce{error 26 0  gRPC client.ReduceFn failed, Retryable: SystemExit(1)}","pipeline":"even-odd-sum","vertex":"compute-sum","stacktrace":"github.com/numaproj/numaflow/pkg/reduce/pnf.(*ProcessAndForward).invokeUDF\n\t/Users/skohli/numa/numaflow/pkg/reduce/pnf/pnf.go:155"}
panic: Got an error while invoking ApplyReduce{error 26 0  gRPC client.ReduceFn failed, Retryable: SystemExit(1)}

goroutine 208 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x0?, {0x0?, 0x0?, 0xc0007a6c40?})
        /Users/skohli/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0007e89c0, {0x0, 0x0, 0x0})
        /Users/skohli/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x24e
go.uber.org/zap.(*SugaredLogger).log(0xc0007a9318, 0x4, {0x0?, 0x0?}, {0xc0009b3ee0?, 0x0?, 0x0?}, {0x0, 0x0, 0x0})
        /Users/skohli/go/pkg/mod/go.uber.org/zap@v1.26.0/sugar.go:316 +0xec
go.uber.org/zap.(*SugaredLogger).Panic(...)
        /Users/skohli/go/pkg/mod/go.uber.org/zap@v1.26.0/sugar.go:159
github.com/numaproj/numaflow/pkg/reduce/pnf.(*ProcessAndForward).invokeUDF(0xc000520d20, {0x2a9dda0, 0xc0007aa0a0}, 0x0?, 0xc0003e13c0, {0x4047601468?, 0xc0006b6630?})
        /Users/skohli/numa/numaflow/pkg/reduce/pnf/pnf.go:155 +0x39a
created by github.com/numaproj/numaflow/pkg/reduce/pnf.(*ProcessAndForward).AsyncSchedulePnF in goroutine 1
        /Users/skohli/numa/numaflow/pkg/reduce/pnf/pnf.go:131 +0x1f9

TODOs Done: 1) Multiproc restart will work after #161 2) Reduce sends context cancelled on restart which doesn't make the numa container restart as it retries directly on that error. 3) Async reduce error Unit test cases terminating even after mocking the method

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 90.60403% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 94.65%. Comparing base (cb37054) to head (0f0daec).

Files Patch % Lines
...numaflow/reducestreamer/servicer/async_servicer.py 41.17% 10 Missing :warning:
pynumaflow/mapstreamer/servicer/async_servicer.py 90.90% 0 Missing and 1 partial :warning:
pynumaflow/reducer/servicer/async_servicer.py 92.30% 1 Missing :warning:
pynumaflow/reducestreamer/servicer/task_manager.py 75.00% 1 Missing :warning:
pynumaflow/sourcer/servicer/async_servicer.py 96.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #160 +/- ## ========================================== - Coverage 94.98% 94.65% -0.34% ========================================== Files 52 52 Lines 1956 2002 +46 Branches 120 119 -1 ========================================== + Hits 1858 1895 +37 - Misses 69 78 +9 Partials 29 29 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

yhl25 commented 5 months ago

What is the status of this PR? is it ready for review?