Closed nasushkov closed 2 years ago
Can you eliminate Docker from the equation? Most of the networking issues involving Docker are related to Docker itself, so, eliminating it would help ensure the problem is on the collector side.
You can use this as a reference: https://github.com/jpkrohling/opentelemetry-collector-deployment-patterns/tree/main/pattern-4-load-balancing
@jpkrohling thanks for getting back. I'll try to eliminate Docker from my setup, as you suggested above. Nevertheless, I already checked your reference and it seems to be a little outdated at first glance:
Error: failed to get config: cannot unmarshal the configuration: error reading exporters configuration for "loadbalancing": 1 error(s) decoding:
- 'protocol.otlp' has invalid keys: insecure 2022/02/16 11:46:12 collector server run finished with error: failed to get config: cannot unmarshal the configuration: error reading exporters configuration for "loadbalancing": 1 error(s) decoding:
It seems that insecure: true should be under tls
Error: unknown flag: --metrics-addr 2022/02/16 14:29:20 collector server run finished with error: unknown flag: --metrics-addr
So, any ideas how I can configure this port? Right now I can't start two collectors as they are conflicting on 8888
Yes, you can use something like this in your configuration file, close to where the pipelines are defined:
service:
telemetry:
metrics:
address: ":8988"
@jpkrohling awesome, now it works. I managed to test my setup without docker and the problem is gone. So you are right, there is an issue with that. What I have on my mind is that docker-compose does not really wait for a real readiness of the service in _dependson it just starts services in an order provided. Can it be an issue or maybe you have some ideas about what can go wrong?
Besides that, I have some questions regarding the architecture:
Can I use multiple instances of LB to eliminate a single point of failure (i.e. LB)?
Yes. For highly elastic services, you might have consistency problems: one load balancer might have a different list of backends until it refreshes it again. The result is that trace IDs might end up on different backends for a short period of time. If this is critical to you, keep the TTL low. If that's still not acceptable, let me know.
What was the reason to use gRPC for communication between LB and downstream collectors? Can it be a performance issue (TCP as a transport layer)?
gRPC is the default transport for OTLP. I don't see it as being a source of performance issues. Do you have a specific problem in mind?
Yes. For highly elastic services, you might have consistency problems: one load balancer might have a different list of backends until it refreshes it again. The result is that trace IDs might end up on different backends for a short period of time. If this is critical to you, keep the TTL low. If that's still not acceptable, let me know.
So, it means that I can schedule 2 or more LB-s and it will work except that I can have some temporary problems when scaling up/down, right?
gRPC is the default transport for OTLP. I don't see it as being a source of performance issues. Do you have a specific problem in mind?
We prefer UDP over TCP as a transport protocol for telemetry (at least we use UDP on our agents). In particular, we had some hard times in the past with TCP due to its overhead and HOL blocking. However, as far as I understand otlp does not support UDP at the moment, so I just wonder can it be a bottleneck, especially in case of downstream collector failures. In that case we can loose some packages and HOL can happen.
So, it means that I can schedule 2 or more LB-s and it will work except that I can have some temporary problems when scaling up/down, right?
Correct
We prefer UDP over TCP as a transport protocol for telemetry
That's not supported with OTLP. You can have your load balancers be configured to accept data via UDP with the Jaeger receiver, for instance, but the communication between the load balancers and the backing collectors is going to use either HTTP or gRPC, as only the OTLP exporter is supported.
Ok, I think the next step would be to deploy this setup in k8s and test it under the load. Also, would you mind if we make a pr in the future to support Jeager exporter in case we've got any issues with TCP?
Regarding the issue, I'll close it for now. Thanks for your support!
Also, would you mind if we make a pr in the future to support Jeager exporter in case we've got any issues with TCP?
If you get into concrete problems, do open an issue and we can certainly discuss it!
Yes. For highly elastic services, you might have consistency problems: one load balancer might have a different list of backends until it refreshes it again. The result is that trace IDs might end up on different backends for a short period of time. If this is critical to you, keep the TTL low. If that's still not acceptable, let me know. So, it means that I can schedule 2 or more LB-s and it will work except that I can have some temporary problems when scaling up/down, right?
Correct
This thread may be a bit dated, but I am interested in the rationale that 2 or more LB instances are able to maintain the integrity of traceid awareness, disregarding the short term issues. What is the mechanism to ensure each of the LB instances informed about each other's traceid routing? since the physical LB in front of collector LB instances may just send the spans to a different collector LB instances. See the flow I am demonstrating below:
Physical LB --> Collector LB instance 1 -> loadbalancing to multiple collector backends --> Collector LB instance 2 -> loadbalancing to multiple collector backends
Update: I listened to Juraci's talk again, and it seems that the LB collector instance calculating the hashing of traceid to determine which backend to send to, so I suppose multi LB instance would maintain the ingrity by using the same algorithm without the need to share their memory.
Describe the bug
I'm trying to make a proof of concept for a tail-based sampling solution, which is somewhat described here. I've built two distributions and placed each in a separate Docker image. One is built with the load-balancing exporter and the other with the tail-based sampling processor. I test this setup locally with a simple docker-compose file (described below).
I send spans to the load-balancing collector (UDP port 6832) and I see them exported as logging exporter shows. However, I don't see they are received by the tail-based sampling collector down the pipeline. What am I doing wrong here?
Steps to reproduce
What did you expect to see?
Spans are shown in the Jeager UI
What did you see instead?
I don't see any spans and they were not received by the tail-sampling collector.
What version did you use? Version:
v0.42.0
,v0.43.0
for the tailsamplingprocessorWhat config did you use?
Load balancer configuration:
Tail-based sampling configuration:
Here is my docker-compose file:
Environment OS: macOS Big Sur (11.3.1)
Additional context
I use the following Dockerfile to build stuff:
Makefile with a build-prod script: