Closed avahahn closed 1 month ago
There are a bunch of files which do not have any change in them besides a rearrange / alignment. Those files should not form part of this PR.
Well, they can (and probably should) be part of the PR, as I assume they form preparatory work, but should be done as separate commits first.
There are a bunch of files which do not have any change in them besides a rearrange / alignment. Those files should not form part of this PR.
Well, they can (and probably should) be part of the PR, as I assume they form preparatory work, but should be done as separate commits first.
They are just alignment changes, things like
- something
- .call()
+ something.call()
or
- something(StructName{property});
+ something( StructName { property } );
I don't see how these are prep works 🙁 , though I might be missing something.
I don't see how these are prep works 🙁 , though I might be missing something.
Well, if they're not being subsequently touched then leave them alone full stop...
Great job! I wonder if
nxt_otel_send_trace()
is asynchronous or synchronous?
I would like to see some document describing the overall architecture of this... especially as a lot of it is effectively hidden away in a black box...
Great job! I wonder if
nxt_otel_send_trace()
is asynchronous or synchronous?
that function call is synchronous. It grabs the span at the pointer, and sends it to the channel and returns.
The span will then be handled asynchronously on the other end of the channel, but NGINX no longer needs to know about it.
Great job! I wonder if
nxt_otel_send_trace()
is asynchronous or synchronous?that function call is synchronous. It grabs the span at the pointer, and sends it to the channel and returns.
The span will then be handled asynchronously on the other end of the channel, but NGINX no longer needs to know about it.
The sender which calls nxt_otel_send_trace()
synchronously runs in the threads of the router process,
where does the receiver which receives span and handles it asynchronously run?
Great job! I wonder if
nxt_otel_send_trace()
is asynchronous or synchronous?that function call is synchronous. It grabs the span at the pointer, and sends it to the channel and returns.
The span will then be handled asynchronously on the other end of the channel, but NGINX no longer needs to know about it.
To be clear, the other side of this channel is tokio?
I think what @hongzhidao is getting at is whether this will cause Unit to block.
IIRC the plan was to have a thread pool rust side that would deal with the actual sending of the telemetry data and thus allowing Unit to get back to handling client requests ASAP.
Great job! I wonder if
nxt_otel_send_trace()
is asynchronous or synchronous?that function call is synchronous. It grabs the span at the pointer, and sends it to the channel and returns. The span will then be handled asynchronously on the other end of the channel, but NGINX no longer needs to know about it.
The sender which calls
nxt_otel_send_trace()
synchronously runs in the threads of the router process, where does the receiver which receives span and handles it asynchronously run?
when the nxt_otel_init() is called, we spawn a new thread from the rust code with the receiver passed to it, and in that thread the function called runtime
will loop over messages coming from the channel. As long as the channel is open, and no shutdown message is received, that function loops in a separate thread.
I think what @hongzhidao is getting at is whether this will cause Unit to block.
Correct.
I think what @hongzhidao is getting at is whether this will cause Unit to block.
Correct.
No, it will not block.
I think what @hongzhidao is getting at is whether this will cause Unit to block.
Correct.
No, it will not block.
That would be great.
Btw, you could test it with some performance tool likewrk
.
@avahahn
I think it would be a big help if you were to squash these two commits together, then see how it should be split up.
E.g.
So we end up with a logical set of commits that progress the incorporation of this feature. Each commit should make sense and be reviewable on its own.
That would significantly improve the review process.
Well, they can (and probably should) be part of the PR, as I assume they form preparatory work, but should be done as separate commits first.
I agree, if we just get it linted and merged we wont have to deal with it again.
I would like to see some document describing the overall architecture of this... especially as a lot of it is effectively hidden away in a black box...
In this case it is simple: the Span object has a custom destructor (implementation of the Drop trait) defined in the library we are using that calls the linked tracer object to send off the span before it is deallocated.
@avahahn
I think it would be a big help if you were to squash these two commits together, then see how it should be split up.
E.g.
- Add core OTEL stuff (C & Rust)
- Add build infrastructure
- Add tracepoints
- Wire it up to the config system
So we end up with a logical set of commits that progress the incorporation of this feature. Each commit should make sense and be reviewable on its own.
That would significantly improve the review process.
Does Add tracepoints
mean trace http request? If yes, it's good to separate core OTEL stuff and http OTEL stuff.
- Add tracepoints
- Wire it up to the config system
That would significantly improve the review process.
Does
Add tracepoints
mean trace http request? If yes, it's good to separate core OTEL stuff and http OTEL stuff.
So I was meaning these things
+#if (NXT_HAVE_OTEL)
+ nxt_otel_test_and_call_state(task, r);
+#endif
But I think I know what you're referring to and yes.
B.t.w we could probably make them a little nicer by wrapping them in macro/inline function, so e.g.
+#if (NXT_HAVE_OTEL)
+ nxt_otel_test_and_call_state(task, r);
+#endif
would become something like (and variations of)
+NXT_OTEL_TRACEPOINT(task, r);
Especially if we're going to be sprinkling these things all over the place...
Hell maybe at some point we'd want to do something like static keys.
- Add tracepoints
- Wire it up to the config system
That would significantly improve the review process.
Does
Add tracepoints
mean trace http request? If yes, it's good to separate core OTEL stuff and http OTEL stuff.So I was meaning these things
+#if (NXT_HAVE_OTEL) + nxt_otel_test_and_call_state(task, r); +#endif
But I think I know what you're referring to and yes.
@ac000,
Is it clear to you when you review nxt_otel.c
? I mean putting core and http OTEL stuff together.
nxt_otel.c
looks like pure OTEL stuff but doesn't contain the caller like http request.
@hongzhidao
@ac000, Is it clear to you when you review
nxt_otel.c
? I mean putting core and http OTEL stuff together.nxt_otel.c
looks like pure OTEL stuff but doesn't contain the caller like http request.
Are you thinking things should be moved into or out of nxt_otel.c
?
Just so we're on the same page, could you give a specific example?
Though for me the current way the PR is presented, it's hard to answer overall architectural questions like that...
The naming looks like a otel library wrapped in Unit to be used in other places if I didn’t look into internal source code, but it contains http stuff. It also contains conf validation, it’s clear that we usually put all validation stuff into nxt_conf_validation.c, I’ve commented this point. I’d like to see something description about the intention of the file.
The naming looks like a otel library wrapped in Unit to be used in other places if I didn’t look into internal source code,
Right, that's basically what it is.
If I've got it right, it's basically a rust wrapper around https://opentelemetry.io/docs/languages/rust/ with a C interface on-top.
but it contains http stuff. It also contains conf validation, it’s clear that we usually put all validation stuff into nxt_conf_validation.c, I’ve commented this point
Yes, you are quite right about that.
I’d like to see something description about the intention of the file.
I'd expect nxt_otel.c
to be the C interface to the above mentioned rust wrapper...
Essentially I see nxt_otel.c
as being analogous to say what nxt_openssl.c
is to OpenSSL...
Wait, what!?
diff --git ./auto/sources ./auto/sources
andrew@kappa unit:otel-clean $ git diff --stat master..
auto/help | 2 +
auto/options | 2 +
auto/otel | 27 +
auto/sources | 4 +
auto/summary | 1 +
configure | 4 +
docs/unit-openapi.yaml | 85 +++
src/nxt_conf_validation.c | 31 ++
src/nxt_h1proto.c | 18 +
src/nxt_http.h | 5 +
src/nxt_http_request.c | 23 +-
src/nxt_main.c | 6 +-
src/nxt_main.h | 1 -
src/nxt_otel.c | 356 ++++++++++++
src/nxt_otel.h | 57 ++
src/nxt_router.c | 41 +-
src/otel/Cargo.lock | 1724 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/otel/Cargo.toml | 22 +
src/otel/src/lib.rs | 291 ++++++++++
tools/unitctl/unit-openapi/README.md | 7 +
tools/unitctl/unitctl/src/cmd/applications.rs | 20 +-
tools/unitctl/unitctl/src/cmd/edit.rs | 13 +-
tools/unitctl/unitctl/src/cmd/execute.rs | 15 +-
tools/unitctl/unitctl/src/cmd/listeners.rs | 11 +-
tools/unitctl/unitctl/src/cmd/mod.rs | 2 +-
tools/unitctl/unitctl/src/cmd/save.rs | 44 +-
tools/unitctl/unitctl/src/cmd/status.rs | 11 +-
tools/unitctl/unitctl/src/main.rs | 8 +-
tools/unitctl/unitctl/src/wait.rs | 7 +-
29 files changed, 2756 insertions(+), 82 deletions(-)
Why are we messing around under tools/unitctl
?
Most of this seems to come from
commit 675ce164a631f88541d554d91b9f877ccc4f8860
Author: Ava Hahn <a.hahn@f5.com>
Date: Mon Sep 9 12:19:04 2024 -0700
Phase 2 OpenTelemetry Extensions
* migrate opentelemetry tracer lifetime into router process
* extend unit configuration API with validators for otel config
* update openapi spec
* regenerate unitctl openapi package
* add method and path fields to opentelemetry span
* spans export in configurable batches on background thread
Co-authored-by: Gabor Javorszky <g.javorszky@f5.com>
Signed-off-by: Ava Hahn <a.hahn@f5.com>
auto/otel | 4 -
docs/unit-openapi.yaml | 85 +++++
src/nxt_conf_validation.c | 34 +-
src/nxt_http_request.c | 11 +-
src/nxt_main.c | 14 +-
src/nxt_main.h | 3 -
src/nxt_otel.c | 73 +++-
src/nxt_otel.h | 28 +-
src/nxt_router.c | 41 ++-
src/otel/Cargo.lock | 1051 +++++++++++++++++++++++++++++++++++++++++++--------------
src/otel/Cargo.toml | 15 +-
src/otel/src/lib.rs | 227 ++++++++++---
tools/unitctl/unit-openapi/README.md | 7 +
tools/unitctl/unitctl/src/cmd/applications.rs | 20 +-
tools/unitctl/unitctl/src/cmd/edit.rs | 13 +-
tools/unitctl/unitctl/src/cmd/execute.rs | 15 +-
tools/unitctl/unitctl/src/cmd/listeners.rs | 11 +-
tools/unitctl/unitctl/src/cmd/mod.rs | 2 +-
tools/unitctl/unitctl/src/cmd/save.rs | 44 ++-
tools/unitctl/unitctl/src/cmd/status.rs | 11 +-
tools/unitctl/unitctl/src/main.rs | 8 +-
tools/unitctl/unitctl/src/wait.rs | 7 +-
22 files changed, 1312 insertions(+), 412 deletions(-)
All the tools/unitctl
stuff needs to go into their own commit(s)...
All the
tools/unitctl
stuff needs to go into their own commit(s)...
Yep, that's what I was going for with this comment: https://github.com/nginx/unit/pull/1431#pullrequestreview-2308788605
Every single change in the unitctl package is just code alignment, and I get that they may be prep work (https://github.com/nginx/unit/pull/1431#issuecomment-2355611262), they shouldn't form part here 🙂
All the
tools/unitctl
stuff needs to go into their own commit(s)...Yep, that's what I was going for with this comment: #1431 (review)
Every single change in the unitctl package is just code alignment, and I get that they may be prep work (#1431 (comment)), they shouldn't form part here 🙂
This would imply it's more than just that...
* update openapi spec
* regenerate unitctl openapi package
Which seems to include some name changes...
Let me quote from our CONTRIBUTING.md
- Create atomic commits. A commit should do just one thing, i.e. you
shouldn't mix refactoring with functional code changes. Do the
refactoring in one or more commits first.
Ideally you should rebase locally and force push new commits up.
This could be be expanded out to something like
Each commit should do one thing, and one thing only. If you find
yourself, in the commit message, adding phrases like "Also do [...]" or
"While in here [...]", then that's a sign that the change should have
been split into multiple commits. If your change includes some
refactoring of code to make your change possible, then that refactoring
should be a separate commit, done first. That means this preparatory
commit won't have any functional changes, and hence should be a no-op.
It also means that your main commit, with the change that you actually
care about, will be smaller and easier to review.
Each commit must stand on its own in terms of what it provides, and how
it works. Lots of changes are just a single commit, but for something a
bit more involved, it's not uncommon to have a pull request contain
multiple commits. Make each commit as simple as possible, and not any
simpler. We'd much rather see 10 simple commits than 2 more complicated
ones. If you stumble across something that needs fixing while making an
unrelated change, then please make that change as a separate commit,
explaining why it's being made.
Each commit in a series must be buildable, it's not enough that the end
result is buildable. See reason 3 in the introduction for why that's the
case.
No fixup commits! Sometimes people post a change and errors are pointed
out in the commit, and the author then does a followup fix for that
error. This isn't acceptable, please squash fixup commits into the
commit that introduced the problem in the first place. This is done by
amending the fix into the original commit that caused the issue. You can
do that with git rebase -i and arrange the commit order such that the
fixup is right after the original commit, and then use 's' (for squash)
to squash the fixup into the original commit. Don't forget to edit the
commit message while doing that, as git will combine the two commit
messages into one. Or you can do it manually. Once done, force push your
rewritten git history. See reasons 1-3 in the introduction series for
why that is.
Now I didn't just make that up... That's from the liburing CONTRIBUTING.md.
Just saying...
@avahahn
Can I kindly ask you to present these patches in a coherent manner.
As it currently stands it would be easier to review it if was all one big patch, which is how I'm currently reviewing it (git diff master..)
If we don't follow our own advice how can we expect others to?
Oopsie, looks like an errant change...
diff --git ./auto/ssltls ./auto/ssltls
index 6512d330..d8b79558 100644
--- ./auto/ssltls
+++ ./auto/ssltls
@@ -212,4 +212,4 @@ if [ $NXT_POLARSSL = YES ]; then
$echo
exit 1;
fi
-fi
+fi
\ No newline at end of file
Another newline issue...
diff --git ./auto/otel ./auto/otel
new file mode 100644
index 00000000..11387a1b
--- /dev/null
+++ ./auto/otel
@@ -0,0 +1,25 @@
+
+# Copyright (C) NGINX, Inc.
+
+
+nxt_found=no
+NXT_HAVE_OTEL=NO
+
+NXT_OTEL_LIB_LOC=src/otel/target/debug/libotel.a
+
+cat << END >> $NXT_AUTO_CONFIG_H
+#ifndef NXT_HAVE_OTEL
+#define NXT_HAVE_OTEL 1
+#endif
+END
+
+NXT_LIB_AUX_LIBS="$NXT_LIB_AUX_LIBS $(pkgconf openssl --cflags --libs || "-lssl -lcrypto") $NXT_OTEL_LIB_LOC"
+
+cat << END >> $NXT_MAKEFILE
+
+$NXT_OTEL_LIB_LOC:
+ pushd $NXT_BUILD_DIR/src/otel/
+ cargo build
+ cd ../../
+
+END
\ No newline at end of file
What editor are you using? I've also noticed some erratic indenting in your patches...
The .editorconfig
does cover these files
[{auto/**,*.toml}]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
But maybe your editor doesn't use it...
Can I kindly ask you to present these patches in a coherent manner.
As it currently stands it would be easier to review it if was all one big patch, which is how I'm currently reviewing it (git diff master..)
If we don't follow our own advice how can we expect others to?
I am working on that after I take care of low hanging fruit like nits and style preferences.
What editor are you using? I've also noticed some erratic indenting in your patches...
Emacs primarily
Please help me understand what otel does in http? The otel state is allocated when a new request is made.
First thing is that when the traceparent header is received it is parsed to get the span ID and parent ID. Spans may come from many other services and all end up in one unified hierarchy in the collector service.
Next the span is initialized. We make one span per request. It is either linked to the parent declared in the Traceparent header or the span id and parent may be fully autogenerated.
Once all header fields are done parsing they are added to the span. At that point it is guaranteed that any traceparent header has been parsed and that the span is fully initialized.
After the body has been processed the size is added to the span.
Finally as soon as we are sure there is no more data to parse the span is finished and sent to the background thread that sends all spans in batches to the tracer.
What's the difference between the two ways?
What would the benefit of that be over the current state machine? It seems like a slower design that would be more error prone. The current state machine for otel uses pre-existing routines in Unit like the header field parsing and generates the span as soon as it can be in small amounts.
Either way might work but I don't understand why you would prefer it all to be done at once at the end of request processing. I would worry that the request data doesnt live long enough to be copied during the request closing, and I would worry that concentrating all the work at the end would add latency to request closing.
@avahahn
What's the currently expected way to build this?
$ ./configure ... --otel
$ make
...
AR build/lib/libnxt.a
CC build/src/nxt_main.o
LD build/sbin/unitd
/usr/bin/ld: cannot find src/otel/target/debug/libotel.a: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [build/Makefile:2044: build/sbin/unitd] Error 1
$
There is
#ifndef NXT_HAVE_OTEL
#define NXT_HAVE_OTEL 1
in build/include/nxt_auto_config.h
There is seemingly no
src/otel/target/debug/libotel.a:
pushd $NXT_BUILD_DIR/src/otel/
cargo build
cd ../../
target in build/Makefile
Please help me understand what otel does in http? The otel state is allocated when a new request is made.
First thing is that when the traceparent header is received it is parsed to get the span ID and parent ID. Spans may come from many other services and all end up in one unified hierarchy in the collector service.
Next the span is initialized. We make one span per request. It is either linked to the parent declared in the Traceparent header or the span id and parent may be fully autogenerated.
Once all header fields are done parsing they are added to the span. At that point it is guaranteed that any traceparent header has been parsed and that the span is fully initialized.
After the body has been processed the size is added to the span.
Finally as soon as we are sure there is no more data to parse the span is finished and sent to the background thread that sends all spans in batches to the tracer.
What's the difference between the two ways?
What would the benefit of that be over the current state machine? It seems like a slower design that would be more error prone. The current state machine for otel uses pre-existing routines in Unit like the header field parsing and generates the span as soon as it can be in small amounts.
Either way might work but I don't understand why you would prefer it all to be done at once at the end of request processing. I would worry that the request data doesnt live long enough to be copied during the request closing, and I would worry that concentrating all the work at the end would add latency to request closing.
Is it right that we add serval spans but only passed them to Rust engine once? If yes, what's the good to add spans in different places but not add them when they are prepared to send?
Is it right that we add serval spans but only passed them to Rust engine once?
we make one span, and add several events to it.
Is it right that we add serval spans but only passed them to Rust engine once?
we make one span, and add several events to it.
The two functions are defined in rust part?
nxt_otel_add_event_to_trace
nxt_otel_send_trace
Will both of them be communicating with rust?
Trying to manually build the libotel part
Compiling otel v0.1.0 (/home/andrew/src/unit/src/otel)
error[E0308]: mismatched types
--> src/lib.rs:199:35
|
199 | assert_eq!(traceparent.len(), TRACEPARENT_HEADER_LEN);
| ^^^^^^^^^^^^^^^^^^^^^^ expected `usize`, found `u8`
|
help: you can convert a `u8` to a `usize`
|
199 | assert_eq!(traceparent.len(), TRACEPARENT_HEADER_LEN.into());
| +++++++
error[E0308]: mismatched types
--> src/lib.rs:204:9
|
201 | ptr::copy_nonoverlapping(
| ------------------------ arguments to this function are incorrect
...
204 | TRACEPARENT_HEADER_LEN,
| ^^^^^^^^^^^^^^^^^^^^^^ expected `usize`, found `u8`
|
note: function defined here
--> /builddir/build/BUILD/rustc-1.81.0-src/library/core/src/intrinsics.rs:3036:21
help: you can convert a `u8` to a `usize`
|
204 | TRACEPARENT_HEADER_LEN.into(),
| +++++++
error[E0308]: mismatched types
--> src/lib.rs:207:14
|
207 | *buf.add(TRACEPARENT_HEADER_LEN) = b'\0' as _;
| --- ^^^^^^^^^^^^^^^^^^^^^^ expected `usize`, found `u8`
| |
| arguments to this method are incorrect
|
note: method defined here
--> /builddir/build/BUILD/rustc-1.81.0-src/library/core/src/ptr/mut_ptr.rs:1039:25
help: you can convert a `u8` to a `usize`
|
207 | *buf.add(TRACEPARENT_HEADER_LEN.into()) = b'\0' as _;
| +++++++
For more information about this error, try `rustc --explain E0308`.
error: could not compile `otel` (lib) due to 3 previous errors
$ cargo --version
cargo 1.81.0 (2dbb1af80 2024-08-20)
$ rustc --version
rustc 1.81.0 (eeb90cda1 2024-09-04) (Fedora 1.81.0-1.fc40
Is it right that we add serval spans but only passed them to Rust engine once?
we make one span, and add several events to it.
The two functions are defined in rust part?
nxt_otel_add_event_to_trace nxt_otel_send_trace
Will both of them be communicating with rust?
nxt_otel_add_event_to_trace
adds events to a span we started with nxt_otel_get_or_create_trace
nxt_otel_send_trace
is used when we've finished decorating the span with information as we had access to them, and at that point the span from Unit is deemed "complete", and can be sent off to the collector.
Here "event" refers to additional properties of the request: its http verb, the headers used, the path, the body size (if there's any), etc... Event should probably be called Attribute (there's a set_attribute method on the span), but the function of the add event fn is to collect the properties of the single request we have a span for.
Is it right that we add serval spans but only passed them to Rust engine once?
we make one span, and add several events to it.
The two functions are defined in rust part?
nxt_otel_add_event_to_trace nxt_otel_send_trace
Will both of them be communicating with rust?
nxt_otel_add_event_to_trace
adds events to a span we started withnxt_otel_get_or_create_trace
nxt_otel_send_trace
is used when we've finished decorating the span with information as we had access to them, and at that point the span from Unit is deemed "complete", and can be sent off to the collector.Here "event" refers to additional properties of the request: its http verb, the headers used, the path, the body size (if there's any), etc... Event should probably be called Attribute (there's a set_attribute method on the span), but the function of the add event fn is to collect the properties of the single request we have a span for.
Thanks for the explanation.
can be sent off to the collector.
What does collector mean here? Does it mean the thread running rust or the otel server outside of Unit?
What does collector mean here? Does it mean the thread running rust or the otel server outside of Unit?
There are 4 parts to most OpenTelemetry functionality:
This code implementation is mainly part 1, and merely configuring part 2.
Parts 3 and 4 are vendor supplied pieces that we hook up. This is what I used for local testing: https://github.com/grafana/docker-otel-lgtm
After "fixing" the compile errors
$ cargo build
Compiling otel v0.1.0 (/home/andrew/src/unit/src/otel)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.69s
$ make
AR build/lib/libnxt.a
LD build/sbin/unitd
$
Hmm...
$ ls -oh /opt/unit/sbin/unitd
-rwxr-xr-x 1 andrew 2.4M Sep 18 14:08 /opt/unit/sbin/unitd
$ ls -oh build/sbin/unitd
-rwxr-xr-x 1 andrew 92M Sep 20 16:54 build/sbin/unitd
Crikey!
$ strip build/sbin/unitd
$ ls -oh build/sbin/unitd
-rwxr-xr-x 1 andrew 21M Sep 20 16:56 build/sbin/unitd
Rust eh?1
... and /opt/unit/sbin/unitd
wasn't even stripped...
$ strip /opt/unit/sbin/unitd
$ ls -oh /opt/unit/sbin/unitd
-rwxr-xr-x 1 andrew 580K Sep 20 17:01 /opt/unit/sbin/unitd
Now I know libotel is statically linked. But even the libc .a is only 6.6MB
(Just observing...)
How do you configure this in Unit?
I'd expect to see an example in the commit message...
Parts 3 and 4 are vendor supplied pieces that we hook up. This is what I used for local testing: https://github.com/grafana/docker-otel-lgtm
Ugh. My plan is to get it running then observe it under wireshark...
Closing for now, as some large work has to be done in order to address some of these comments. I am excited to put it back up once everything has been taken care of!
This PR contains an OpenTelemetry module that is compiled during the configure phase and statically linked during compilation of Unit. The OpenTelemetry module is a Rust crate that encapsulates the logic needed to generate and send spans and traces, as well as managing batching and background threads.
During operation, Unit will initialize the OpenTelemetry tracer when it is configured by the user in the router thread. If configured, spans will be generated and stored in batches during HTTP request decoding and processing. deconfiguration and reconfiguration will reset and reinitialize the OpenTelemetry tracer.
The following changes are made to the code:
regenerate unitctl openapi package
Things to do before merge: