openzipkin / zipkin-gcp

Reporters and collectors for use in Google Cloud Platform
https://cloud.google.com/trace/docs/zipkin
Apache License 2.0
91 stars 54 forks source link

Better check() method for StackdriverSender #131

Closed meltsufin closed 5 years ago

meltsufin commented 5 years ago

Currently, the StackdriverSender.check method is just a placeholder.

  public CheckResult check() {
    return CheckResult.OK;
  }

Follow-up to https://github.com/spring-cloud/spring-cloud-gcp/issues/1752.

elefeint commented 5 years ago

I'll give this a try. Is there a way to assign issues to non-project-members?

codefromthecrypt commented 5 years ago

@elefeint no idea. instead I just made you and mike project members :)

@saturnism has admin also so in case someone else needs assign power, also anyone in the core team do, too, but not everyone watches this repo. you can ping here if something goes unnoticed https://gitter.im/openzipkin/zipkin

elefeint commented 5 years ago

Even better. Thank you!

codefromthecrypt commented 5 years ago

I'm going to have to take this over because I'm literally losing time constantly and it is driving be bonkers. thanks for the offers folks.

codefromthecrypt commented 5 years ago

I'll start by scraping around and seeing if there are any other tools that do a pre-flight check to stackdriver.

codefromthecrypt commented 5 years ago

I'm tempted to check IAM for current roles, but that would be a false positive as there are other reasons things fail eventhough routinely it is access. maybe an empty batch will suffice. this is how we check some other connections.

elefeint commented 5 years ago

The problem is, Stackdriver won't fail for an empty batch. There seems to be a gating empty span check before it gets to the functionality. Sending a single span with a missing trace ID may be deterministic -- if the API works, you get a message about a missing trace ID. But if the API does not work, it fails for another reason.

codefromthecrypt commented 5 years ago

@elefeint gotcha. what if we used the read api instead?

codefromthecrypt commented 5 years ago

or does that require less perms...

elefeint commented 5 years ago

It's completely fine for you to work on this; I was just trying to help because you had higher priority work.

I am testing Stackdriver behavior with the GCP client library (since it cheerfully fails when needed). When Stackdriver API is disabled, and I send a malformed request, I get the validation error message and not the "resource requested" error message. So malformed requests are not a good indicator of permissions/disabled API/other infrastructural issues.

Would it be terrible to send a real trace? It's seems like the only way to confirm reliably.

Reading is a completely different API; that one seems to be HTTP only.

elefeint commented 5 years ago

ugh, "resource exhausted", not "resource requested"

codefromthecrypt commented 5 years ago

if we sent a real trace that would be ok but the health check predates sending. so one way is to use a read api to get a trace, then resend that one? is that what you mean?

elefeint commented 5 years ago

And yeah, different permissions for read/write - https://cloud.google.com/trace/docs/iam

codefromthecrypt commented 5 years ago

ok so one compromise could be to "catch" a trace.. basically have it only do empty span until then, which at least should catch endpoint problems.. then after we catch one, send that repeatedly?

elefeint commented 5 years ago

Yeah, so there is no trace context yet at the time the check happens. So this trace would have to be a disconnected one, with a randomly generated ID.

codefromthecrypt commented 5 years ago

if random is ok, just don't want to interfere with query.. if a dummy trace is sent.. could we just 1970 timestamp or something?

elefeint commented 5 years ago

Empty span will trigger a gating validation error first even if API is disabled. I haven't tested whether input validation or permissions will fail first.

Why not current timestamp? 1970 might look weird when a user is troubleshooting their app.

codefromthecrypt commented 5 years ago

if it is a dummy span, then we'd see that in the trace list though. I think usually health checks have no side-effect which is why I was thinking about this.

catching a span at least wouldn't cause garbage.. that's my brain anyway..

elefeint commented 5 years ago

Yep, pseudo-random is fine: https://cloud.google.com/trace/docs/reference/#generating_trace_id_and_span_id

elefeint commented 5 years ago

I'd prefer to avoid a fake trace entry, too. But how to do that while still catching all issues... What do you mean by "send it repeatedly"? Maybe I did not understand about catching a span.

codefromthecrypt commented 5 years ago

what I mean is that health-check is usually called on interval, so anything we do has the side-effect of repetition. could be every second

codefromthecrypt commented 5 years ago

right now, for example, we know the health check is "dummied out", as soon as we enable this whoever's health check pollers will all of a sudden invoke this. that's why I bring it up basically.

codefromthecrypt commented 5 years ago

on the server side the /health endpoint I mean. on the client side (sleuth) I don't think it is default plumbed to anything tho

elefeint commented 5 years ago

Oh, I did not realize. I thought it was only a startup check.

In principle, you could use a check for IAM permission as you suggested in the beginning, combined with a call to services.get to find out whether stackdriver API is enabled. These two would be the most common configuration pitfalls.

elefeint commented 5 years ago

I don't know that I'd want to do that every second either, though.

codefromthecrypt commented 5 years ago

Is there any rate limit or billing implication you are aware of on the IAM api?

elefeint commented 5 years ago

Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.

The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.

codefromthecrypt commented 5 years ago

if there's a service level failure we could also fail completely. Ex crash and make restart if you have disabled stackdriver or don't have perms.

fail fast in other words.

On Sat, Aug 17, 2019 at 10:38 AM Elena Felder notifications@github.com wrote:

Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.

The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-gcp/issues/131?email_source=notifications&email_token=AAAPVV3X6CTRXARR7CK5G3DQE5QC5A5CNFSM4IH4ZP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QBUJQ#issuecomment-522197542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3SJPGELINYLA245CTQE5QC5ANCNFSM4IH4ZP3A .

codefromthecrypt commented 5 years ago

I wouldn't do this on the "Sender" as we don't want to crash real apps... but it could be legit in the server.

On Sat, Aug 17, 2019 at 10:41 AM Adrian Cole adrian.f.cole@gmail.com wrote:

if there's a service level failure we could also fail completely. Ex crash and make restart if you have disabled stackdriver or don't have perms.

fail fast in other words.

On Sat, Aug 17, 2019 at 10:38 AM Elena Felder notifications@github.com wrote:

Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.

The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-gcp/issues/131?email_source=notifications&email_token=AAAPVV3X6CTRXARR7CK5G3DQE5QC5A5CNFSM4IH4ZP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QBUJQ#issuecomment-522197542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3SJPGELINYLA245CTQE5QC5ANCNFSM4IH4ZP3A .

elefeint commented 5 years ago

Both IAM and Service Management APIs are free:

I'm off to sleep; if there is any research I can do on the cloud side, throw questions in here and I'll pick them up in the morning.

codefromthecrypt commented 5 years ago

thanks and good night @elefeint I will send some thoughts and appreciate you parachuting in

saturnism commented 5 years ago

a thought - can we do a lazy health check? i.e., assume healthy. Upon the first real failure of sending a real span, turn it unhealthy?

codefromthecrypt commented 5 years ago

my current thinking is to put a workaround while we sort it out. keep an expiring reference to a throwable caught during span writes. expire every second. This way the one calling check (server or client) doesn't cause additional load and gets some signal of failure with a human readable message.

codefromthecrypt commented 5 years ago

@elefeint I've unassigned myself as clearly you are on this. I sortof misunderstood crickets as nothing was going to happen. Sorry I jumped to conclusions.

I have some unsolicited 2p about service design because it won't be odd for things like client-side LBs to want to do simple health checks (ex we do polling client-side checks in aws)

First, is a standalone health check api, which can hold all the tribal knowledge of things that could go wrong. If this api is designed simply enough even if it is gRPC, you could invoke it with curl as sending empty message to a gRPC endpoint in curl actually surprisingly works.

Secondly, consider a long-poll health api even if probably more clients do polling. this could be done based on a header or otherwise condition. LINE recently implemented this, you can see a side-effect in armeria itself. https://github.com/line/armeria/releases/tag/armeria-0.90.0

I don't mean to tell folks how to design services, just something like this could shred time in triage regardless of who's "fault" it is, reduce reliance on metrics logging config etc. ideally allowing customers to self-service check all green without incurring billable traffic or causing side effects, or needing a sequence of commands.

elefeint commented 5 years ago

All source code needs to be Java 1.6 compatible? So no lambdas or Clock interface? // realizing how spoiled I've been with Java 8 compilation before

And I agree completely with the need for a uniform health check API. Some of these services grew... very organically.

codefromthecrypt commented 5 years ago

@elefeint so the answer about Java 1.6 is like this..

If you have users running JRE 1.6, then yes :) Our policy is to not escalate the compile needs. Note this usually has more to do with api vs syntax. retrolambda can redo syntax, and I think nanoTime or similar is all that's needed to timeout things. Server can rely on 1.8.

Here's a note in our brave gRPC readme reminding folks why we didn't escalate the compile level to 1.7 which I think is where gRPC is currently at. We probably need a note in the sender's pom about the policy if not there. If stackdriver doesn't work before 1.16 or there are no users using that, it could be fine. Problem is I don't think we've been putting grpc versions into analytics.

    <!-- grpc < 1.15 supports java 6 https://github.com/grpc/grpc-java/issues/3961 -->
    <main.java.version>1.6</main.java.version>