Closed linhr closed 1 year ago
Patch coverage: 71.27
% and project coverage change: +1.88
:tada:
Comparison is base (
6b3cad3
) 65.12% compared to head (a8cc8d1
) 67.00%.:exclamation: Current head a8cc8d1 differs from pull request most recent head 2e86172. Consider uploading reports for the commit 2e86172 to get more accurate results
:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@Nicole00 Thanks for your review! I've addressed the comments accordingly. Let me know if there is more feedback on this PR.
cc @wey-gu
Thank you @linhr for the contribution, it's great to have you in the NebulaGraph community!
What type of PR is this?
What problem(s) does this PR solve?
Issue(s) number
N/A
Description
We have found a few edge cases when the connector does not handle failures gracefully. This PR tries to make the implementation more robust, based on our observations running the connector in production in the past few months.
How do you solve it?
We have the following changes to the connector:
NebulaBatchExecutor.executeBatch()
method. This prevents Java heap memory to grow out-of-bound when there are a large number of correlated errors.We introduce three new options for the Table API connector:
max-retries
: Maximum number of retries in the execution. The default value is3
.retry-delay-ms
: The delay between retries, in milliseconds. The default value is1000
.failure-handler
: The failure handling mode when execution retries are exhausted. The valid values are (1)ignore
, which skips the batch that fails; (2)fail
, which fails the task. When the task fails, Flink may restart the task depending on how the job is configured. This also allows for detection and manual intervention when there is an unrecoverable error (due to corrupted data, unhealthy NebulaGraph cluster, bad network connection, etc.). The default value isignore
, which is consistent with the current implementation. So the option backward-compatible.The DataStream connector can be configured similarly using
ExecutionOptions
.Special notes for your reviewer, ex. impact of this fix, design document, etc:
N/A