open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.64k stars 871 forks source link

Semantic conventions for batch jobs #1347

Open mateuszrzeszutek opened 3 years ago

mateuszrzeszutek commented 3 years ago

What are you trying to achieve?

I want to introduce some semantic conventions for batch jobs, since there are currently no conventions around that. I'm mostly interested in instrumenting Spring Batch applications. There already is a Java JSR-352 spec that describes a batch job API, I was thinking of basing the trace spec on that - the concepts of job/step/chunk seem generic and language-agnostic enough (and there doesn't seem to be any other batch job specification). Before diving into details, is there a place for this in the trace semantic conventions?

Additional context.

fbogsany commented 3 years ago

Good timing: we discussed this in the Ruby SIG meeting earlier this week. Background job queues are common in Ruby web applications, with the dominant implementations being Resque and Sidekiq. So far, we've instrumented those systems using the messaging semantic conventions, but they're really not a great fit for background/batch jobs. See https://github.com/open-telemetry/opentelemetry-ruby/pull/547#issuecomment-758889459

Oberon00 commented 3 years ago

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

fbogsany commented 3 years ago

the concepts of job/step/chunk

^ this bit seems useful.

For Ruby batch job systems, relative to message systems:

  1. "receiver" spans are not particularly useful
  2. The job class name is usually much more interesting as part of the span name than the queue name (e.g. MyJob enqueue vs default send)
  3. The ... enqueue suffix is more in keeping with the domain language than ... send
  4. destination_kind is likely always queue, so probably isn't an interesting thing to specify (and certainly shouldn't be required).
Oberon00 commented 3 years ago

For the "job class name": There are semantic conventions for code locations, see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/span-general.md#source-code-attributes.

fbogsany commented 3 years ago

For the "job class name": There are semantic conventions for code locations, see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/span-general.md#source-code-attributes.

That's useful, but doesn't meet the expectations of users re: span names.

mateuszrzeszutek commented 3 years ago

Good timing: we discussed this in the Ruby SIG meeting earlier this week.

Nice! I'll take a look at sidekiq & your instrumentation and try to extract common parts - at first it looks like the job span would be the only common thing, but maybe there's more. And there's the whole queue logic that's not there in Spring Batch/JSR-352.

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

True, JSR-352 does not expose much information that we could store as attributes, but that does not mean that there are zero of them: there's exit status of a job/step (arbitrary string), job/step execution id (jid in sidekiq?). And the at least for the JSR-352/Spring Batch job steps expose several metrics (read count, write count, ...) that could be used. And probably the most important piece of information is the span name.

mateuszrzeszutek commented 3 years ago

This is what I had in mind for my use case (Spring Batch):

  1. JobLauncher/JobOperator/enqueue span that marks the (possibly) asynchronous start of a batch job processing. Name: Start Job <batch.job.name> Attributes:
    • batch.job.name: the name of the job (Spring Batch), or the job class name (resque/sidekiq), or the task name (celery);
    • batch.job.id: the job execution id (Spring Batch), job['jid'] in case of sidekiq.
  2. Job span that wraps the whole processing of a batch job. Name: Job <batch.job.name> Attributes:
    • batch.job.name;
    • batch.job.id;
    • batch.job.exit_status: a plain string containing the exit status of a job. JSR-352/Spring Batch jobs (and steps) can return an arbitrary user-defined string as the exit status (e.g. error message saying why the job has failed). Not sure how this translates to Ruby/Python frameworks.
  3. Step span that wraps the execution of a step. In JSR-352, a job consists of a series of steps: a step "encapsulates an independent, sequential phase of a batch job". In the simplest possible case, a job that only does one thing has only one step. Name: Job <batch.job.name>.<batch.step.name> Attributes:
    • batch.step.name: the name of the step;
    • batch.step.id: the step execution id;
    • batch.step.exit_status: a plain string containing the exit status of a step. The batch job may take action depending on the result of a step, e.g. send an email and stop further processing in case of failure.
  4. Chunk span. In JSR-352, steps that process numerous items (e.g. read a CSV file and write records into DB) may be split into chunks, which are pretty much equivalent to database transactions. For example, a step that writes thousands of records may be configured to commit every 50 items. There are no attributes that can be set on this span, but an exception can be recorded here.
  5. Item read, process and write spans. This is the lowest level we want to instrument. I think that this image and pseudocode fragment describes what happens here the best. Again, there are no attributes here, but it's worth having spans on the item level because of the exception visibility and better trace structure.

@fbogsany I believe that the first two spans that I've briefly described here match your use case with both Ruby libs that you've mentioned. I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

fbogsany commented 3 years ago

I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

They don't have the Step span, but at Shopify, for example, we have a higher-level job execution framework that provides an equivalent of "chunks", so the Chunk span is relevant there. I'm not sure about the Item read, process and write spans.

weyert commented 3 years ago

Yeah, doesn't look like Hangfire (.net) and Bree (Node.js) they same level structure as Spring Batch (never used this) and looks more simplified

[0] https://docs.hangfire.io/en/latest/background-processing/processing-background-jobs.html [1] https://jobscheduler.net/