tinybirdco / signatures-POC

This repository is dedicated to generating dummy data and a sample dashboard that mirrors a company that manages digital document signatures in real-time.
MIT License
1 stars 0 forks source link

Unhandled Exception in Parallelized Data Processing Routine #5

Open JoeKarlsson opened 1 year ago

JoeKarlsson commented 1 year ago

While running the parallelized data processing routine (process_data_parallel function) in the data_processing.py script, an unhandled exception occurs, halting the entire operation. Error handling mechanisms don't seem to work. Steps to Reproduce

Import process_data_parallel from data_processing.py.
Run process_data_parallel(input_data, num_threads=4) where input_data is a data frame with 1 million rows.

Expected Behavior

The function should process data on all available threads without any errors, and return a processed data frame. Actual Behavior

Throws an unhandled IndexError and halts the process. Environment

Python version: 3.8
Library versions: Pandas 1.3.3, Numpy 1.21.2
OS: Linux Ubuntu 20.04

Possible Solutions

Conventional: Try-catch blocks within each thread to catch and log exceptions for later debugging. But that’s old school and doesn't help to continue with the other tasks.

Contrarian/Proactive: Implement a fallback mechanism that reroutes the failed tasks to a dedicated single thread, which could execute a more robust, although slower, data processing function.

New Technology: Utilize Python’s concurrent.futures with a custom exception handler wrapped around each future.

Quality Product: For mission-critical data pipelines, consider moving to a more robust data processing library like Apache Flink, which has mature fault tolerance.

Note:

Implementing the contrarian solution could anticipate and seamlessly handle similar errors in future without halting the operation, thereby improving the robustness of the function.

JoeKarlsson commented 1 year ago

I'll take a look at this...