openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.98k stars 3.09k forks source link

Zipkin v3: the more maintenance friendly version #3605

Closed codefromthecrypt closed 9 months ago

codefromthecrypt commented 10 months ago

Feature

Hi, back again for a while. My company Tetrate is sponsoring some of my time to help restore zipkin to a maintainable state, notably in a way that doesn't break schemas and has more straightforward paths to CVE updates and whatnot.

This includes primary the following, which would be addressed in separate issues or PRs, and ticked off when done.

Rationale

As years have passed, most possibly all, of the core maintainers do not run zipkin sites anymore. This places people in a bad position of merging change out of duty, not need. Certain key areas have had neglect even when community members raise PRs on them. We need to convert the maintainer team to sites again, and removing hurdles I would conject is a pragmatic way to optimize.

Base layer to not use alpine

While I don't have a specific base layer to suggest, I think the time since we chose alpine and now.. there are many more choices of base layer that are small and support java better than Alpine does. By changing base layer, it is true we may increase the base image and break emotional thresholds like 30, 100MB all in. This could result in some pointing like.. wow look how big java is etc. So, we have to be grown up and be ok with that kind of concern. The conjected value is that by switching, key aspects of our base layer including SSL and Java setup are more commoditized. Those who use it will have an easier time with CVE maintenance because we don't run into issues that need to be resolved sometimes by Alpine themselves. This tunes for a new set of maintainers, site owners who maybe never had a concern like this in the first place.

Elasticsearch driver change

Many issues have accumulated around Elasticsearch support, and while our custom driver (directly integrated with armeria) is more performant and has less dependencies, it has proven to have a couple issues. One is that people have less experience with how to update it. Another is that reviewers have opted out of merging fixes for fear of breaking things. That changes are integrated tested in CI isn't the point. In general, there are both technical and cultural reasons this is no longer a good match. A pragmatic way out is to take the cons of Elasticsearch's SDK including any bloat or compatibility reductions that causes, and performance reduction. The conjecture is that if we use the normal driver, site owners who are not yet maintainers can become maintainers, as well ask ES experts for advice in directly applicable ways. In other words, we become less reliant on historical people like @anuraaga @minwoox and myself for what seems to be a very common concern of ES upgrade maintenance.

Example Scenario

An expected result is that the zipkin v3 release has compatibility with the same schemas as v2 had, limited to any constraints added by the default Elasticsearch Client. The docker images, which are optional as some layer their own, likely are an ubuntu slim distribution which is in all likelihood larger. The exec jars are both larger due to the ES client dependency tree. However, all the artifacts can use latest Elasticsearch version backends. Zipkin v3 is effectively centered on this change of ES support, and that in order to do that we to rewrite that component. Everything else is the same.

Prior Art

Zipkin as a skywalking extension

Most notably, this is an alternative to the V3 informally introduced in this PR which was to switch zipkin to be effectively a skywalking extension. In hindsight, that required more discussion and stake with end users vs historical admins. Notably, it breaks all the schemas and would put patch duty requirement on skywalking, which is unlike norms of zipkin which has a history of limited to no dependence on other projects except notably armeria.

In any case, this can restart in the contrib org like kafka if desired, but I would caution about making core changes here that only serve this goal. At least until zipkin is stable again.

Effectively, this issue re-owns version 3 for continuity and gives time for the community to refocus on compatability and potentially a new set of site owning committers. Later, that team can decide if working with ES was a goal vs changing to zipkin being a skywalking dependent ecosystem. Regardless, we thank the skywalking team for interesting alternatives to the maintenance problem and hope to stay friends.

Copying of folks

Notably pinging folks who may not have been active recently, but should know about this idea. I will in parallel start re-acclimating with Java etc, to hit the ground running next monday. but yeah if there's a swell of folks says please don't do this stuff, I'll trash any WIP. I have timeboxed the next two weeks to sort this out.

Docker image

Our base image was originally created by @abesto and then @anuraaga and I had a hand in its recent incarnation. @llinder though deserves mountains of credit for keeping it up to date these last years. This plan swaps it out for something lean, but ubuntu not alpine.

Elasticsearch

@anuraaga @minwoox and I worked at length on the current ES client which used to be okhttp. @zeagord also had a hand in this, and thanks @llinder for recent maintenance. @xeraa gave a lot of insight over the years as well. This plan goes to the normal Java SDK and ends the custom one, while retaining all the integration tests that we can pass with it (likely all except things sensitive to old ES versions)

V3 is for Skywalking

@jcchavezs championed the initiative led by skywalking team (@mrproliu @wu-sheng and I think @hanahmily) to make zipkin a skywalking extension (core server uses skywalking, and while there are schema breaks, manifest the same APIs and lens UI). @basvanbeek @shakuzen and @anuraaga possibly @jeqo were notified about this work, and may have strong feelings about an alternative. I want to ping them directly rather than walk around this topic.

Historical admins

This also dances on the topic of historical admins of zipkin no longer being sites. Zipkin's culture was defined by its majority being site owners, at one point only me being a non-site owner. While I will raise another issue about this dillemma, I'll tag the list @openzipkin/core meanwhile, so folks can start thinking about this. In the issue I raise, it will really be about trying to harness non-vendor site owners again, so that stakeful decisions lead the way as it once did.

codefromthecrypt commented 10 months ago

TL;DR; I was wrong to blame alpine for maintenance issues. We have a lack of maintenance issue that's likely not made easier with alpine. We also have a lack of transparency on CVE prioritization ( no trivy), and maybe need better guidance on alternatives (roll your own image which you control CVEs on) and how to get help (look at the normal docker-alpine image). If no one has focus for these tasks, they become sporadic and CVE issues are only one type of problem inherent in that situation. Whether we keep or drop alpine, most problems are the same, iotw. We need to open an issue to get stakeful maintainers and routine releases again.


In this issue description, I used the word conjecture when discussing two problem areas. Both points I conjected as a root cause of maintenance issues.. I didn't know that to be the case. I'm beginning to believe lack of maintenance is a vicious cycle, as people can easily become unfamiliar with the parts. If releases were routine and always updated to latest everything.. and change merged quickly as it once was, things wouldn't be pressing as they are now. Basically, it goes right back to site-owner involvement or routine dedicated curation, even if only once a month. Blaming things on alpine distracts from this real root issue.

I'll discuss the alpine topic to illustrate this. Right now, this image is not re-cut when alpine versions are. Instead, it is done sporadically, mostly thanks to duty of @llinder. If we used a different image and didn't update frequently, we would also have CVEs bubble up, CVE isn't about alpine, though there's certainly a case about lack of familiarity both with the image and also with how to know if a workaround is in progress.

Case in point https://github.com/openzipkin/docker-alpine/pull/34 about CVE-2023-2975. This was not transparently detected, rather by a trivy setup internal to my company. That's problem one: we should be transparent about things like this, and the whole thing is even more important as my company isn't monitoring zipkin anymore! So, there really should be trivy setup and if not, it doesn't matter what base layer you use, there's a chance of CVE going unnoticed.

Now, onto the specific CVE, which I can tell was quite annoying. I'll assume we had to fix it, but in truth that's questionable to do out of band, as we intentionally keep our java executable straightforward due to prior work with Tyro bank. A lot of places flat out will not depend on a image outside their control for reasons exactly like this.. anyway back to the point. In this case the issue was trying to fix something between alpine versions, which is why it was annoying. Whoever felt this was urgent would be amiss without the following info.. alpine has a docker-alpine image. So, they tracked the same issue and resolved it in a formal alpine release 5 days after we made a workaround! I didn't look, but it is likely our zipkin image wasn't redone until roughly then anyway. In other words, I think with better docs on the base layer about where to look for CVE help (upstream) and setting up trivy (to know what problems are) things would be better. However, this only matters if someone routinely turns cranks and can work with sites to prioritize CVEs. A prioritization could have eliminated some urgency and annoyance, iotw. This sort of devops leadership is a job that needs to be filled, and if it results in routine releases, CVEs will be back to non-issues as they once were, as every time we did a release, and it was frequent, all deps of all kinds are updated or documented why not.

codefromthecrypt commented 9 months ago

I'm closing this out because over the last three weeks I can say for sure the number one issue in zipkin is not a technical one. There's no single image or library to blame on what was unintentional neglect by committers. I found pull requests not progressed in three years, and the attempt to move to ES 8 was almost complete, just didn't hook up the test image. Armeria wasn't the issue, it was lack of sustained attention from maintainers, from what is likely a lack of stake in things that are difficult or annoying to complete.

TL;DR; There's no reason to chase down and undo the work we did to get small docker images and performant elasticsearch. If there is technical budget for time, spend it more wisely probably on cassandra 5 storage attached indices or similar. What would be better than that would be connecting to sites and seeing who can help move zipkin forward in a sustainable way, doing as we have done in keeping things minimal, coherent, current and tested.