Closed meltsufin closed 5 years ago
I'll give this a try. Is there a way to assign issues to non-project-members?
@elefeint no idea. instead I just made you and mike project members :)
@saturnism has admin also so in case someone else needs assign power, also anyone in the core team do, too, but not everyone watches this repo. you can ping here if something goes unnoticed https://gitter.im/openzipkin/zipkin
Even better. Thank you!
I'm going to have to take this over because I'm literally losing time constantly and it is driving be bonkers. thanks for the offers folks.
I'll start by scraping around and seeing if there are any other tools that do a pre-flight check to stackdriver.
I'm tempted to check IAM for current roles, but that would be a false positive as there are other reasons things fail eventhough routinely it is access. maybe an empty batch will suffice. this is how we check some other connections.
The problem is, Stackdriver won't fail for an empty batch. There seems to be a gating empty span check before it gets to the functionality. Sending a single span with a missing trace ID may be deterministic -- if the API works, you get a message about a missing trace ID. But if the API does not work, it fails for another reason.
@elefeint gotcha. what if we used the read api instead?
or does that require less perms...
It's completely fine for you to work on this; I was just trying to help because you had higher priority work.
I am testing Stackdriver behavior with the GCP client library (since it cheerfully fails when needed). When Stackdriver API is disabled, and I send a malformed request, I get the validation error message and not the "resource requested" error message. So malformed requests are not a good indicator of permissions/disabled API/other infrastructural issues.
Would it be terrible to send a real trace? It's seems like the only way to confirm reliably.
Reading is a completely different API; that one seems to be HTTP only.
ugh, "resource exhausted", not "resource requested"
if we sent a real trace that would be ok but the health check predates sending. so one way is to use a read api to get a trace, then resend that one? is that what you mean?
And yeah, different permissions for read/write - https://cloud.google.com/trace/docs/iam
ok so one compromise could be to "catch" a trace.. basically have it only do empty span until then, which at least should catch endpoint problems.. then after we catch one, send that repeatedly?
Yeah, so there is no trace context yet at the time the check happens. So this trace would have to be a disconnected one, with a randomly generated ID.
if random is ok, just don't want to interfere with query.. if a dummy trace is sent.. could we just 1970 timestamp or something?
Empty span will trigger a gating validation error first even if API is disabled. I haven't tested whether input validation or permissions will fail first.
Why not current timestamp? 1970 might look weird when a user is troubleshooting their app.
if it is a dummy span, then we'd see that in the trace list though. I think usually health checks have no side-effect which is why I was thinking about this.
catching a span at least wouldn't cause garbage.. that's my brain anyway..
Yep, pseudo-random is fine: https://cloud.google.com/trace/docs/reference/#generating_trace_id_and_span_id
I'd prefer to avoid a fake trace entry, too. But how to do that while still catching all issues... What do you mean by "send it repeatedly"? Maybe I did not understand about catching a span.
what I mean is that health-check is usually called on interval, so anything we do has the side-effect of repetition. could be every second
right now, for example, we know the health check is "dummied out", as soon as we enable this whoever's health check pollers will all of a sudden invoke this. that's why I bring it up basically.
on the server side the /health endpoint I mean. on the client side (sleuth) I don't think it is default plumbed to anything tho
Oh, I did not realize. I thought it was only a startup check.
In principle, you could use a check for IAM permission as you suggested in the beginning, combined with a call to services.get
to find out whether stackdriver API is enabled. These two would be the most common configuration pitfalls.
I don't know that I'd want to do that every second either, though.
Is there any rate limit or billing implication you are aware of on the IAM api?
Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.
The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.
if there's a service level failure we could also fail completely. Ex crash and make restart if you have disabled stackdriver or don't have perms.
fail fast in other words.
On Sat, Aug 17, 2019 at 10:38 AM Elena Felder notifications@github.com wrote:
Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.
The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-gcp/issues/131?email_source=notifications&email_token=AAAPVV3X6CTRXARR7CK5G3DQE5QC5A5CNFSM4IH4ZP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QBUJQ#issuecomment-522197542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3SJPGELINYLA245CTQE5QC5ANCNFSM4IH4ZP3A .
I wouldn't do this on the "Sender" as we don't want to crash real apps... but it could be legit in the server.
On Sat, Aug 17, 2019 at 10:41 AM Adrian Cole adrian.f.cole@gmail.com wrote:
if there's a service level failure we could also fail completely. Ex crash and make restart if you have disabled stackdriver or don't have perms.
fail fast in other words.
On Sat, Aug 17, 2019 at 10:38 AM Elena Felder notifications@github.com wrote:
Looked it up in my personal project, default quota for IAM Read requests per minute is 6000, but for service management API it's only 120 requests per minute.
The healthcheck could be opt-out, gated by a property, I suppose, for quota-conscious users.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-gcp/issues/131?email_source=notifications&email_token=AAAPVV3X6CTRXARR7CK5G3DQE5QC5A5CNFSM4IH4ZP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QBUJQ#issuecomment-522197542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3SJPGELINYLA245CTQE5QC5ANCNFSM4IH4ZP3A .
Both IAM and Service Management APIs are free:
I'm off to sleep; if there is any research I can do on the cloud side, throw questions in here and I'll pick them up in the morning.
thanks and good night @elefeint I will send some thoughts and appreciate you parachuting in
a thought - can we do a lazy health check? i.e., assume healthy. Upon the first real failure of sending a real span, turn it unhealthy?
my current thinking is to put a workaround while we sort it out. keep an expiring reference to a throwable caught during span writes. expire every second. This way the one calling check (server or client) doesn't cause additional load and gets some signal of failure with a human readable message.
@elefeint I've unassigned myself as clearly you are on this. I sortof misunderstood crickets as nothing was going to happen. Sorry I jumped to conclusions.
I have some unsolicited 2p about service design because it won't be odd for things like client-side LBs to want to do simple health checks (ex we do polling client-side checks in aws)
First, is a standalone health check api, which can hold all the tribal knowledge of things that could go wrong. If this api is designed simply enough even if it is gRPC, you could invoke it with curl as sending empty message to a gRPC endpoint in curl actually surprisingly works.
Secondly, consider a long-poll health api even if probably more clients do polling. this could be done based on a header or otherwise condition. LINE recently implemented this, you can see a side-effect in armeria itself. https://github.com/line/armeria/releases/tag/armeria-0.90.0
I don't mean to tell folks how to design services, just something like this could shred time in triage regardless of who's "fault" it is, reduce reliance on metrics logging config etc. ideally allowing customers to self-service check all green without incurring billable traffic or causing side effects, or needing a sequence of commands.
All source code needs to be Java 1.6 compatible? So no lambdas or Clock interface? // realizing how spoiled I've been with Java 8 compilation before
And I agree completely with the need for a uniform health check API. Some of these services grew... very organically.
@elefeint so the answer about Java 1.6 is like this..
If you have users running JRE 1.6, then yes :) Our policy is to not escalate the compile needs. Note this usually has more to do with api vs syntax. retrolambda can redo syntax, and I think nanoTime or similar is all that's needed to timeout things. Server can rely on 1.8.
Here's a note in our brave gRPC readme reminding folks why we didn't escalate the compile level to 1.7 which I think is where gRPC is currently at. We probably need a note in the sender's pom about the policy if not there. If stackdriver doesn't work before 1.16 or there are no users using that, it could be fine. Problem is I don't think we've been putting grpc versions into analytics.
<!-- grpc < 1.15 supports java 6 https://github.com/grpc/grpc-java/issues/3961 -->
<main.java.version>1.6</main.java.version>
Currently, the
StackdriverSender.check
method is just a placeholder.Follow-up to https://github.com/spring-cloud/spring-cloud-gcp/issues/1752.