Add Tracing Feature for Curve

wu-hanqing commented 1 year ago

Description: At present, Curve has logging and metrics, both of which can be used to analyze performance as well as locate problems. While they improve the observability of the system, the granularity is coarse and does not allow for precise analysis of how long requests take at each stage. Tracing is a powerful tool that can concatenate invocation relationships between services and log invocation time in the request dimension, preserving essential information and concatenating dispersed log events to help us better understand system behavior, assist in debugging and troubleshooting performance issues.
Expected Outcome: Design the solution and implement it, introduce it into CurveBS, and analyze the latency of IO requests. The implementation needs to be well scalable and can be applied to other modules.
Recommended Skills: C++, OpenTracing

kriti-sc commented 1 year ago

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

wu-hanqing commented 1 year ago

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

Hi @kriti-sc, I am glad you are interested in this project. But, I think now there is only an idea, lack of a clear design plan, should we first discuss a plan, such as what framework / library to use, write demos to verify, etc.

kriti-sc commented 1 year ago

Hi @wu-hanqing. I agree with you. Since the idea is already there, I am working on an approach to resolve this issue.

caoxianfei1 commented 1 year ago

@kriti-sc Ok，feel free to try it.

Ziy1-Tan commented 1 year ago

I want to try it.

wu-hanqing commented 1 year ago

I want to try it.

Of course, please note the timeline, and feel free to raise any ideas or questions you may have.

zzzz-vincent commented 1 year ago

Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.

wu-hanqing commented 1 year ago

Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.

hi, OpenTracing and OpenCensus are merged into OpenTelemetry, so you can try this.

kriti-sc commented 1 year ago

Hi all, I will be withdrawing from this feature. I outline my initial thoughts below.

This goal is to enable tracing in CurveBS and using the trace data, analyze the latency of IO requests. There are 3 components to building the solution:

Instrumentation

This step would introduce methods to trace an IO request as it flows through the system into the codebase. I intend to use OpenTelemetry for it, as that is the standard today. OpenTracing (which you mention in the issue) has been subsumed by OpenTelemetry. OpenTelemetry has an API SDK in C++, so I intend to use that. The following are the different pieces to gathering trace data:

Trace: A trace represents the entire execution path of the request. In the case of an IO request in Curve, a trace would start when the Curve IO call is first made by the user/client. The trace would end when the IO request has been completed and responded to by Curve. Thus, a trace will be started when the IO request makes the first Curve API call and a corresponding unique trace ID will be generated.
Span: A span represents a single unit of work through the entire execution path of the request. A trace may contain multiple spans. For example, to service an IO request, multiple components of Curve are involved and multiple function calls are made within Curve. Each function call will be one span. Each of these spans will have a reference to the trace they are part of. Thus, each function call will be a span and associated with the original trace. A span will be started when a function starts and will end just before the function returns. Each span will contain the start time and end time of the function call.
Context Propagation: Usually, there are multiple function calls in a single function. Thus, function calls may be nested. To understand the execution path of a request, it is important to capture the nested nature of function calls. It is important to capture from where the current function was called, and the status of the stack at that point. This is achieved using context propagation. Relevant telemetry data is stored as context in the calling function and then propagated to the callee function. In the callee function, telemetry data is gathered and added as context before being propagated back to the caller function when the function call returns. Thus, context propagation will be done before every function call. Context from the calling function will be propagated to the callee, and then from the callee back to the caller. The contexts will be the spans corresponding to each function.

These three pieces of information put together are called a trace and give us an entire picture of the execution path of a request, along with how long each step took. Custom metrics can be added as well.

Collecting instrumentation data

The trace data is collected on the servers the application is running. It is then transported to a central system, where all the trace data from the multiple servers are brought to a single place. For this purpose, we will use Jaeger Agent and Jaeger Collector. Jaeger Agent will be deployed on the application servers and will collect the trace data and send it to the Jaeger Collector. Jaeger Collector will be the central system that collects all the trace data from all the application servers and processes it. Jaeger Agent and Collector both support OpenTelemetry formats.

Analyzing instrumentation data

For the purpose of analysis and visualization, we will use Jaeger again. Particularly the Jaeger Query feature.

Some implementation considerations by @wu-hanqing:

It needs to have good scalability and can be easily applied to other modules.
The impact on performance needs to be evaluated. If the impact is significant, it needs to be able to dynamically turn on or off.
Deployment of related components (OpenTelemetry/Jeager). If there is sufficient time, it is best to integrate the deployment process into curveadm .

Ziy1-Tan commented 1 year ago

Design docs and PR, Welcome to continue :)

UniverseParticle commented 1 year ago

I want to try it. assign me

Cyber-SiKu commented 1 year ago

@UniverseParticle Have you encountered any difficulties?

wuhongsong commented 1 year ago

its difficult, and it will be a hard issue in curve summer coding camp

opencurve / curve