Open wu-hanqing opened 1 year ago
Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?
Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?
Hi @kriti-sc, I am glad you are interested in this project. But, I think now there is only an idea, lack of a clear design plan, should we first discuss a plan, such as what framework / library to use, write demos to verify, etc.
Hi @wu-hanqing. I agree with you. Since the idea is already there, I am working on an approach to resolve this issue.
@kriti-sc Ok,feel free to try it.
I want to try it.
I want to try it.
Of course, please note the timeline, and feel free to raise any ideas or questions you may have.
Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.
Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.
hi, OpenTracing and OpenCensus are merged into OpenTelemetry, so you can try this.
Hi all, I will be withdrawing from this feature. I outline my initial thoughts below.
This goal is to enable tracing in CurveBS and using the trace data, analyze the latency of IO requests. There are 3 components to building the solution:
This step would introduce methods to trace an IO request as it flows through the system into the codebase. I intend to use OpenTelemetry for it, as that is the standard today. OpenTracing (which you mention in the issue) has been subsumed by OpenTelemetry. OpenTelemetry has an API SDK in C++, so I intend to use that. The following are the different pieces to gathering trace data:
Trace: A trace represents the entire execution path of the request. In the case of an IO request in Curve, a trace would start when the Curve IO call is first made by the user/client. The trace would end when the IO request has been completed and responded to by Curve. Thus, a trace will be started when the IO request makes the first Curve API call and a corresponding unique trace ID will be generated.
Span: A span represents a single unit of work through the entire execution path of the request. A trace may contain multiple spans. For example, to service an IO request, multiple components of Curve are involved and multiple function calls are made within Curve. Each function call will be one span. Each of these spans will have a reference to the trace they are part of. Thus, each function call will be a span and associated with the original trace. A span will be started when a function starts and will end just before the function returns. Each span will contain the start time and end time of the function call.
Context Propagation: Usually, there are multiple function calls in a single function. Thus, function calls may be nested. To understand the execution path of a request, it is important to capture the nested nature of function calls. It is important to capture from where the current function was called, and the status of the stack at that point. This is achieved using context propagation. Relevant telemetry data is stored as context in the calling function and then propagated to the callee function. In the callee function, telemetry data is gathered and added as context before being propagated back to the caller function when the function call returns. Thus, context propagation will be done before every function call. Context from the calling function will be propagated to the callee, and then from the callee back to the caller. The contexts will be the spans corresponding to each function.
These three pieces of information put together are called a trace and give us an entire picture of the execution path of a request, along with how long each step took. Custom metrics can be added as well.
The trace data is collected on the servers the application is running. It is then transported to a central system, where all the trace data from the multiple servers are brought to a single place. For this purpose, we will use Jaeger Agent and Jaeger Collector. Jaeger Agent will be deployed on the application servers and will collect the trace data and send it to the Jaeger Collector. Jaeger Collector will be the central system that collects all the trace data from all the application servers and processes it. Jaeger Agent and Collector both support OpenTelemetry formats.
For the purpose of analysis and visualization, we will use Jaeger again. Particularly the Jaeger Query feature.
Some implementation considerations by @wu-hanqing:
Design docs and PR, Welcome to continue :)
I want to try it. assign me
@UniverseParticle Have you encountered any difficulties?
its difficult, and it will be a hard issue in curve summer coding camp