hypertrace / hypertrace-ingester

Streaming jobs for Hypertrace
Other
13 stars 16 forks source link

Support zipkin format instead (or along with) jaeger #95

Open jcchavezs opened 3 years ago

jcchavezs commented 3 years ago

Currently debugging and reproducing issues in the pipeline is very hard and problematic. If we need to understand why a span data turned into certain values, the most basic thing you do is try to reproduce the error. At the moment we don't have a way to do that, there isn't a way to obtain the original span that was ingested at the end of the pipeline nor in the storage. This is makes debugging almost impossible and bring over the table more way more complex proposals like enabling debug mode in prod.

The current approach to overcome this issue is to deploy a jaeger instance along with the hypertrace and download the data using the jaeger download button. While this feels like a (already cumbersome) solution it isn't because what you download isn't exactly what you ingested, formats are different: jaeger ingests over zipkin format (like other zipkin forks) or using thrift and neither of them is what you download from jaeger UI (a json payload). To overcome this issue temporary I created this tool https://github.com/jcchavezs/jaeger2zipkin which allows you to convert jaeger downloaded trace into zipkin trace and then be able to re-ingest it but that isn't 100% reliable because there are two transformations in the middle.

What would truly embrace full debugability is that we can download the same format we ingest. For that we would need to support ingesting zipkin data (which is what most of our agents use for reporting) and also keep the raw payload in the messages along the kafka pipeline. We don't event need to understand zipkin format (e.g. we don't need to serialize/deserialize the raw payload on ever step in the pipeline) but just to store and serve in the API to download and re-ingest it.

This along with a download button in the UI would allow us to ask users to send us the original data and we can debug the issues locally which is what other DT solutions do.

Ping @kotharironak @tim-mwangi @JBAhire @rish691 @buchi-busireddy @sanjay-nagaraj

buchi-busireddy commented 3 years ago

@jcchavezs Doesn't including the original span as a blob in every span increase the Kafka message size drastically? Are you suggesting to enable that conditionally or always? I thought if we finish the export feature that you logged earlier, we would be able to get the original span back when the trace is exported. Isn't it?

jcchavezs commented 3 years ago

Yes it might (around) duplicate the size of payload. Not sure how big is that (it would be good to have numbers about the average size of a message in the platform) but at least the increasement is deterministic.

Regarding the export function, the whole point of having the export function is that we can access to the original message. If we translate Jaeger -> enrichment -> Hypertrace -> Zipkin it will be kind of pointless because when in a problem, we are using corrupted data to replicate a bug.

On Sun, 3 Jan 2021, 01:48 Buchi Reddy Busi Reddy, notifications@github.com wrote:

@jcchavezs https://github.com/jcchavezs Doesn't including the original span as a blob in every span increase the Kafka message size drastically? Are you suggesting to enable that conditionally or always? I thought if we finish the export feature that you logged earlier, we would be able to get the original span back when the trace is exported. Isn't it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hypertrace/hypertrace-ingester/issues/95#issuecomment-753550070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXOYARM73H5P7PTKE6IXFTSX65HFANCNFSM4VPGIY7Q .

buchi-busireddy commented 3 years ago

I definitely like the idea and see the necessity. Just thinking of different ways to implement. Thinking out loud, another major way to consider is storing the raw spans into a KV store and accessing them for the export feature. By retaining only the most recent spans and cap'ing the KV store, we can make sure we don't replicate the full data and at the same time, we don't increase the Kafka message size drastically. There could be other ideas too. I'd recommend a feature/design doc for this so that we can finalize on the approach and implement it. what do you think?

findingrish commented 3 years ago

So we ingest data in Zipkin format in the platform and convert it into RawSpan (our internal format) at entry.

To overcome this issue temporary I created this tool https://github.com/jcchavezs/jaeger2zipkin which allows you to convert jaeger downloaded trace into zipkin trace and then be able to re-ingest it but that isn't 100% reliable because there are two transformations in the middle.

What do we miss in this conversion? Exploring on these lines, can we add some metadata to RawSpan message and see if this conversion Zipkin <-> RawSpan can be made lossless? Then the download button in ui, just needs to invoke this converter.

I can create a doc to further this discussion.