open-telemetry / opentelemetry-go

OpenTelemetry Go API and SDK
https://opentelemetry.io/
Apache License 2.0
5.12k stars 1.03k forks source link

Issues with current design #19

Closed cstockton closed 5 years ago

cstockton commented 5 years ago

First will there be an official process to make appeals against the projects design decisions? I am worried that collectively open-telemetry is implementing the former projects designs under what I believe to be a false premise it can "always change things later". I feel such changes will be impossible once the library is in wide spread use.

I have a number of concerns with the actual implementation here but it will be a large effort for me to enumerate them. For now I would like to start with some higher level issues that would resolve many of my code related concerns.

Exporters should not exist in this project (fundamental issue with OpenTelemetry as a whole)

If I could choose a single issue to focus on this would be it, as I feel it creates most of the design issues I have to begin with. This is an issue with the greater OpenTelemetry project as a hole. As far as "openness" OpenTelemetry is a step backwards from OpenTracing. It's placing OpenTelemetry as a gatekeeper of implementations, in fact it defines no formal specifications at all which I reported an issue about https://github.com/open-telemetry/opentelemetry-specification/issues/20 and discuss in this gist. Instead of regurgitating these points in detail here I would ask anyone interested to please read this conversation, to summarize:

OpenTracing - Each SDK provides an interface in opentracing- which trace providers may implement in their own repositories to ship traces, for example

For tracing to be useful you must have participation from many teams across your organization. It's not uncommon for a request to span many languages (java, C#, Go, Python, Nodejs, ...) at a medium sized company. Thus if you want to provide a new tracer and integrate well with an existing ecosystem you must commit the dev cycles to implement all of these languages. There really is no reason to continue on this path with the "next generation" convergence efforts of "OpenTelemetry" that will leave a footprint on the ecosystem for years.

OpenTelemetry - Each SDK provides no interface as a "standard" design principle (not that I've seen anything resembling a standard design principle as I've echo'd here, in gitter, etc). This project seems to provide an ABI only, which is a step backwards from the API of OpenTracing. In addition it seems there is a drive to have the default OpenTelemetry SDK's all talk to a central "OpenTelemetry" agent which will be bundled with exporters further diminishing "openness". Push for an agent after your SDK's reduces the per-language burden we see above with OpenTracing but inserts OpenTelemetry as the gatekeeper of tracing implementations. At the very worst the "maintainers / governing body / process" could be captured by commerical interests to impede competition. At best it slowly grows to be a more arduous and painful process as layers of "exporters" increase bloat, complexity and reduce maintainability until it's no longer worth the efforts.

MyProposal - Stop writing code. Collectively define a structural specification that will slowly produce the initial version for a wire format for carrying trace events. Separate trace backends from the clients at the protocol level. Now you can resume writing code.

Trace vendors implement 1 server for every language

Open telemetry can provide reference implementations of clients that ship to these trace servers, which is what this repository would become. But this leaves room for those who want to implement their own clients because the chosen balance of ergonomics, performance, was not right for them. As it stands now you have a monumental effort beyond writing a client for a well defined wire-format. If the maintainers of this project disagree with my stance on plugins, great! I don't care. I can write my own client. As it stands I won't have that option without also implementing the glue for the tracer which defeats the purpose of investing resources in these convergence efforts. I really would like to urge the project owners to pump the brakes a little and consider the amount of effort that could spared by taking a different approach. Or at least provide some strong technical reasons to march on with this paradigm that expand beyond "this is what we had".

Testing Frameworks

Unit tests are not just for testing results, they also lock in your public API. Using a large unit tests abstraction that accepts interfaces everywhere means that type changes are possible without breaking unit tests. It also imposes additional cognitive load for potential contributors who are unfamiliar with the chosen libraries. In short they only provide a tool to circumvent engineering rigor, relieving authors of both the responsibility to think about struct composition and the benefits of "testing" the said structures to begin with. I strongly believe the projects quality would suffer substantially by using them.

Plugins

I am not sure the design considerations behind the scenes that led to requiring a plugin for tracing via "stdout" but I very strongly object. Plugins should be entirely out of the question as an API requirement.

I could continue on, but I feel this is a decent start for a technical debate. The most important argument against plugins is that they may still be provided for proprietary tracing solutions with careful design, you define the Tracer interface and write a Tracer which is backed by a plugin. Knowing this, I think it's easy to prove there is no real reason to circumvent an API to provide an ABI.

Dependencies

First, I feel this project should depend on only the standard library, maybe a couple packages from golang.org/x if well justified. I strongly believe this should be a non-negotiable requirement because it will force a healthier design than what the current tracing libraries of today have. As it stands now this library depends on a massive (257) amount of packages go.sum and this project is just getting started. I understand that part of this is because there is a fundamental problem with the design of OpenTelemetry as a whole providing implementations rather than specifications, but there is still plenty of opportunity to improve here.

jmacd commented 5 years ago

@cstockton I don't know if this helps your concern, but the prototype code that we merged in Go was very incomplete, and I've filed issue 20 about separating the API from the SDK properly, which is a requirement from OpenTracing. I am working on this separation.

In the terminology that's developing, an "exporter" is different than a "plugin". I think I agree with your reasoning. We can have an SDK that has practically no dependencies except standard libraries. An SDK could write to stdout or it could send spans to a server.

The OpenCensus style of library put a lot more logic into the client, which allows various kinds of exporter to be implemented in the process. I'm more interested in a client that quickly encodes data for another process to process and export data, which is why we have to separate the SDK and API here and in the other repos. Still, the goal is to end up like OpenTracing, where there's an API that you can use without knowing what the SDK provides. If you're objecting to the current state of the code, please wait for issue 20 to be resolved, after that and once the trace data format begins to crystalize, we can have an SDK that does very little other than encode bytes and write them somewhere.

tigrannajaryan commented 5 years ago

@cstockton I am not an OpenTelemetry Admin or a member of TSC, the following is just my personal opinion.

My understanding is that the primary current goal for OpenTelemetry is merging of OpenCensus and OpenTracing in backwards compatible manner. The motivation is to have a clear support and migration path for existing users of OpenCensus and OpenTracing. This is already a pretty difficult goal, anything on top of it should likely be avoided at this stage.

My opinion is that any improvements should be left for later. Attempting to merge these 2 projects and at the same time come up with significant new improvements and conceptual changes will make the project a lot more difficult and possibly result in a failure. I think that would be wrong and I support gradual change and improvement approach.

I understand your suggestion to design a better wire protocol and can see the value in the particular approach suggested (shipping span events instead of complete spans). I can also see downsides to the suggested approach of shipping events, so it is not so clear-cut (I can expand on this if interested).

I applaud your effort to come up with better solutions than what we already have, however I do not think process-wise what you propose helps with the project goals. I think a better process would be for you and other interested parties to work on your own proposal in parallel to current efforts by the rest of OpenTelemetry community, come up with design documents and sample implementation demonstrating the benefits, present it to OpenTelemetry community and aim to make it part of future OpenTelemetry API (V2).

I am personally of the opinion that existing wire protocols and memory representation of span data is not well-suited to certain high-performance use cases and I plan on following my own advice and building my case for a new high-perf proto that I will present to the community if my experiments are successful.

This brings me to the other related topic on which you commented - API compatibility - so I'll answer it here.

What I wrote above is precisely why I think it is important to have a well-defined API compatibility story and guarantees. I do not think that we will be able to come up with the "best" OpenTelemetry in the first version and that we will need to have a V2. If this is true then in my opinion it is very important that today we send a clear message to all OpenTelemetry adopters (especially to third-party library authors) that when V2 is introduced it will provide zero or almost zero-effort backward compatibility story.

Strong compatibility story de-risks OpenTelemetry adoption for third-party library authors and for end users and at the same time provides a path for radical improvements in the future (so that "can always change things later" is a technical possibility).

iredelmeier commented 5 years ago

@cstockton: without getting too the in weeds here - although I will add that I agree with many of your points! - any chance you'd mind filing an RFC (or two!) here?

cstockton commented 5 years ago

What I wrote above is precisely why I think it is important to have a well-defined API compatibility story and guarantees. I do not think that we will be able to come up with the "best" OpenTelemetry in the first version and that we will need to have a V2. If this is true then in my opinion it is very important that today we send a clear message to all OpenTelemetry adopters (especially to third-party library authors) that when V2 is introduced it will provide zero or almost zero-effort backward compatibility story.

This is where my fundamental disagreement is. Merging OpenTracing and OpenCensus is going to cause massive churn for the ecosystem already- more so on the OpenTracing end as it seems that project is essentially being deleted in favor of the OpenCensus design. I can not find any benefits to go through these efforts with plans to "change everything to v2 later" - we are forcing churn in 2 communities (OpenTracing OpenCensus -> OpenTelemetry) and again for OpenTelemetry V1 -> V2. Why not do our best to come out with a V1 that has a foundation we can build upon for years? Maybe it really is impossible, but I don't understand why we can't slow down a bit and try. Both ecosystems work today, what is with the sense of urgency here? :/

@iredelmeier - I've started an issue here https://github.com/open-telemetry/opentelemetry-specification/issues/20 would changing this into a RFC in your repo yield a different outcome? Currently my issues have been dismissed under the premise "we can't change anything because of compatibility" while assuring me the same argument won't be valid in the future after N months of user adoption and thousands of lines of code are written. You'll have to forgive me if I'm skeptical about investing further time to voice my concerns to OpenTelemetry.

I'll leave it to the OpenTelemetry team to close my issues if they would like. I'll review code at some point for any security related concerns / major bugs, but refrain from commenting on the design decisions I fervently disagree with to keep progress moving forward for everyone. Thanks.

bhs commented 5 years ago

@cstockton thank you for raising this.

This issue/thread seems almost by-definition unresolvable given the number of meaty potential tangents above and the constraints of GitHub issues as a medium (maybe I'm arguing for the "threading" of an RFC PR as @iredelmeier has been advocating for). As such, I apologize in advance for only addressing a few of the points above... for the many aspects I'm not addressing, it's not because I think they're trivial (they are not).

All of that said, without further ado:

The sense of urgency: the current state (of OpenTracing, OpenCensus, and OpenTelemetry concurrently co-developing) is an unstable equilibrium for the larger ecosystem. This won't be resolved until there's a viable migration path from OpenTracing and OpenCensus code to OpenTelemetry. One blocker there could be backwards-incompatibility (which has been raised already in this thread). It's worth mentioning that an inadequate OpenTelemetry API could also be a blocker and should be taken seriously. So, the urgency is about getting to a stable equilibrium where OpenTelemetry is the clear best-choice of the three, as the unstable situation today will cause friction and anxiety; but the actual API (as opposed to the SDK) does require careful thought and scrutiny since poor execution there will be the ultimate counterweight to adoption.

Separation of API and SDK: I think @jmacd addressed this well already, but it's important to reiterate that this has been a high-level goal for the OpenTelemetry project since the beginning... e.g., the Loose coupling of components section of the blog post announcing the intention to merge these projects.

tedsuo commented 5 years ago

@cstockton on vacation currently, sorry for the rambling reply. To address your core concerns:

I really hope this feels like "being heard" and not like everyone is telling you no... you raise good points, and also are showing places where the messaging is clearly not landing - we should not be having some of these convos in github issues; the answers should be clear and obvious. The main criticism that lands (for me, at least) is there are not nearly enough high level design requirements being written down, and the data format needs to be put front and center.

Finally, I'm happy to jump on a vidchat when I come back from vacation, and discuss some of these topics live. I would also suggest coming to the OpenTelemetry Spec and community meetings if you can make it to them, and voicing your concerns there. I plan on bringing up the data-format-first and other issues you raise here in tomorrows meeting, on your behalf if you cannot attend.

cstockton commented 5 years ago

@tedsuo Thanks a lot for this response, most of your points deserve taking some time to digest and think about. One thing that is unclear to me is the API/SDK separation, perhaps this is a terminology thing and I've failed to convey one of my biggest concerns or I'm failing to understand your roadmap. I have two choices in the current design:

1) Import a vendors binary blob into my code as a plugin and have that ship my traces 2) Use the "builtin" binary blob that ships to the OpenTelemetry agent via an API not suitable for widespread adoption as a neutral wire format which then imports all of the vendor binary blobs.

But I strongly feel that I should be able to import opentelemetry-go and ship traces anywhere using one vendor neutral wire format. Without that the community isn't truly open, the biggest players are just bundling their transports together. To be honest, I would have been happy with any first revision of a wire format. Comma separated streams of JSON- sounds good. Binary stream of gzipped excel spread sheets, sure, why not. At least then the SME's for kafka, rmq, insert cool tech here can begin writing agents, aggregators, queuing mechanisms and more cool tech here to fit all kinds of use cases and begin a strong feedback loop for this newly emerging wire format.

Most importantly our (non-contributors) hands are untied and we have something to build upon.

I respect and accept that my vision may not reflect yours. I also am in no place to continue to contest it (I'm not doing the work here!) - this was just a final attempt to convey it. I'm happy to agree to disagree and let everyone get back to work. Overall I'm thrilled and look forward to the projects progress.

jmacd commented 5 years ago

I think we can close this based on the discussion here, since these two issues appear to be very similar: https://github.com/open-telemetry/opentelemetry-go/issues/23#issuecomment-511989757