prometheus / client_model

Data model artifacts for Prometheus.
http://prometheus.io
Apache License 2.0
74 stars 73 forks source link

Problem converting to proto3: Go API doesn't indicate omitted fields #13

Closed juliusv closed 5 years ago

juliusv commented 7 years ago

As discussed in https://docs.google.com/document/d/1Qj_m-lbatySAU0vsVfAPG7mduN1-nevrpD85uQhi4sY/edit?usp=sharing, I started converting the existing metrics.proto to proto3. However, the Go code generator has a problem when using proto3: it doesn't represent optional fields as pointers anymore, but as values (https://github.com/golang/protobuf/blob/master/protoc-gen-go/generator/generator.go#L2074-L2084). So you can't see if something is a zero value or omitted. But we need to check whether the client-side timestamp was set: https://github.com/prometheus/common/blob/4402f4e5ea79ec15f3c574773b6a5198fbea215f/expfmt/decode.go#L204

I don't see any good course of action. Treating a zero timestamp as unset is suboptimal.

Does anyone have any ideas or experience with this?

beorn7 commented 7 years ago

This issue is essentially the same with proto2. If an optional field in proto2 is set to its default value, it can (and often will) be encoded as unset on the wire. It's equivalent as per the standard. Or in other words: If somebody really set the timestamp to 0, it could legally (and probably would once it went through the wire) show up as "nil" in the code you referenced.

I vaguely remembered we discussed the issue before and simply decided that you cannot set a timestamp that represents 1970-01-01 00:00:00.000 in Prometheus. (That's less bad than ignoring leap seconds, in case that helps. ;)

juliusv commented 7 years ago

Eek, I do remember discussing something like that, but forgot the details. So we're basically out of luck then, but we consider it ok because the problem existed before? Oh well...

juliusv commented 7 years ago

(I mean, at least there's apparently no way to fix it without breaking things)

brian-brazil commented 7 years ago

If somebody really set the timestamp to 0, it could legally (and probably would once it went through the wire) show up as "nil" in the code you referenced.

This is not the case for timestamp, as there is no default value set. In fact if a default was set this wouldn't be a pointer iirc, as it's not possible to have it unset.

This isn't really a problem for timestamps, but it is for values. We already have cases where we're e.g. exporting Summarys without _sum/_count and I think this is something we need to address.

I wasn't aware that the time series Prometheus ingesting were not always matching what the text format exposed, and I consider that a bug.

juliusv commented 7 years ago

From the proto2 docs at https://developers.google.com/protocol-buffers/docs/proto#optional:

"If the default value is not specified for an optional element, a type-specific default value is used instead: [...] For numeric types, the default value is zero."

So there is still a default value for the default value, so the timestamp issue exists?

It doesn't seem like we can do anything about that and the _sum / _count fields. I didn't know that there were summaries that didn't send those fields.

juliusv commented 7 years ago

Ah right, I guess those were const summaries which were proxied from some other system which didn't provide that data.

brian-brazil commented 7 years ago

Yeah, Dropwizard doesn't have them for example and we've a case in Go too iirc.

beorn7 commented 7 years ago

In the on-wire format of protobuf, a count in a summary that is 0 is indistinguishable from a count that is not set.

In the on-wire format of protobuf, a timestamp that is 0 is indistinguishable from a timestamp that is not set.

If we ever want to signal the "unset" state, we needed a separate field for that, or a special value that signals the unset state. The latter is effectively the case for timestamps ("0" means "unset"). The count of a summary is a new case as our external data model so far doesn't allow summaries without a count.

beorn7 commented 7 years ago

I wasn't aware that the time series Prometheus ingesting were not always matching what the text format exposed, and I consider that a bug.

Prometheus ingestion follows the external data model. What you suggest is a change of the data model and not a bug report.

brian-brazil commented 7 years ago

In the on-wire format of protobuf, a count in a summary that is 0 is indistinguishable from a count that is not set.

This is not true for proto 2, as we haven't set default = 0.

In the on-wire format of protobuf, a timestamp that is 0 is indistinguishable from a timestamp that is not set.

Ditto.

Prometheus ingestion follows the external data model. What you suggest is a change of the data model and not a bug report.

It's not a change of data model, it's a clarification of how we handle things that don't exactly match the data model. For example I already presumed we were ignoring unset fields, and there's nothing in our docs that contradicts that statement.

This is underspecified as-is. The fields are listed as "optional".

beorn7 commented 7 years ago

We learned the proto2 spec the hard way. If an optional value is set to its default value, it is encoded as "unset" on the wire. What I said above is true for proto2.

Our external data model is defined by metrics.proto. The text format is merely a later addition to represent it in a different form. Since there really is no notion of unset optional fields in protobuf (they are set to their default value by definition), our data model has no notion of unset fields.

brian-brazil commented 7 years ago

If an optional value is set to its default value, it is encoded as "unset" on the wire. What I said above is true for proto2.

This statement is true. However it's also not relevant, as we don't set any the default values in the proto.

Since there really is no notion of unset optional fields in protobuf (they are set to their default value by definition), our data model has no notion of unset fields.

You're confusing proto 2 and proto 3. By going to proto 3 we lose the ability to have unset fields, and that's a problem when you consider things like filtered queries as the client will return the correct result but then Prometheus would add unwanted incorrect timeseries.

beorn7 commented 7 years ago

I quote @juliusv from above:

From the proto2 docs at https://developers.google.com/protocol-buffers/docs/proto#optional: "If the default value is not specified for an optional element, a type-specific default value is used instead: [...] For numeric types, the default value is zero."

Thus, there is a default value for each optional value, even in proto2.

The conclusion stands: There is no notion of unset optional fields in proto2.

brian-brazil commented 7 years ago

Thus, there is a default value for each optional value, even in proto2.

This is correct - but default in the sense that if you choose ti use the Get API calls it'll return that default (a 0 value). That's not the same default as one optionally set in the config, which can be any value.

The conclusion stands: There is no notion of unset optional fields in proto2.

This is incorrect. If there is no default in the proto file, you can distinguish based on whether it's sent over the wire and at the API level (depending on language) based on whether the pointer is null or using the HasField method.

See https://developers.google.com/protocol-buffers/docs/reference/python-generated#singular-fields-proto2 for example.

beorn7 commented 7 years ago

If there is no default in the proto file, you can distinguish based on whether it's sent over the wire and at the API level (depending on language) based on whether the pointer is null or using the HasField method.

As said, we have learned the hard way that this is not true. There are compliant encoders that will just not send an optional field over the wire if it is set to its default value. It will appear as "unset" after decoding.

brian-brazil commented 7 years ago

You're confusing two different types of "default". An encoding that fails to send an optional field which fails to send a 0 for an optional field which has no default is not compliant, as that defeats the entire purpose of a proto2 optional.

In any case, we need to support these as optional fields. It would be a breaking change to do otherwise, as something that was representable before would no longer be representable. Thus we we can't do a straight switch to proto3 in 1.0.

Repeated fields that are specified to take a maximum of one value should do the trick.

beorn7 commented 7 years ago

In that case, the (Go) encoder we worked with would have been buggy. Could you point me to the location in the spec where that behavior is specified? My research back then resulted in the opposite (namely that an optional numeric field without explicit default value that is set to 0 can be encoded as unset).

juliusv commented 7 years ago

Indeed, I cannot find anything in the specification going either way. @beorn7's memory of this being a problem in the past is the only thing we have. Do you still remember enough to dig up something (PR / issue / ...) about that?

brian-brazil commented 7 years ago

It sounds indeed like Go might be buggy. The generated proto2 .pb files look to handle this correctly.

I'm working off memory here, primarily of the C++ implementation. Trawling through the docs this isn't well specified. I wonder if the docs got messed up due to proto3 making this aspect go away.

The only relevant line is:

When a message is parsed, if it does not contain an optional element, the corresponding field in the parsed object is set to the default value for that field.

Which says nothing about indicating via the API whether it was parsed. There's also nothing about not emitting values equal to defaults.

brian-brazil commented 7 years ago

I will note the only behaviour I've ever heard of being inconsistent across proto implementations was packed values not being handled by a PHP implementation.

juliusv commented 7 years ago

This issue on protobuf.js mentioning Google's Datastore API points in the direction that it's a bug to omit zero-valued optional fields on the wire: https://github.com/dcodeIO/protobuf.js/issues/179

beorn7 commented 7 years ago

I'm currently trying to retrace my research back then. I see a lot of discussions about this, but they are all kind of speculative. Many claim the difference between proto2 and proto3, but that never really touches the on-the-wire requirement (or non-requirement) of encoding an optional field that only has an implicit default value (i.e. numeric values with 0 as default). The spec seems to miss any explicit statement. (Which might imply why encoders get implemented with the optimization of kicking out optional values if they are set to their default value.) proto3 doesn't have the ambiguity, for obvious reasons, at least that's a change...

The meaning of optional was never really "optional" (in the English sense), so the reason everything is optional in metrics.proto is not that it would really be optional. That was always a problem with proto2 (that optional doesn't really mean "optional").

I'll double-check the behavior of Go, but I'm pretty sure that:

beorn7 commented 7 years ago

Status so far:

beorn7 commented 7 years ago

Found the original incident. It was all discussed on IRC, on 2016-09-16. It happened when the Pushgateway would serialize MetricFamily protobuf to disk and then re-read them. The problems were observed with the Type field (a protobuf enum) and with label values (protobuf strings). (So my first idea was that the default value optimization only affects those types, but with current Go code, I couldn't reproduce.)

As a consequence back then, we did https://github.com/prometheus/common/commit/17828d3823e17483780267633436da7fdd932771 and https://github.com/prometheus/pushgateway/pull/58 (the former would cause a crash for counters, and the latter would show "nil" for empty label values).

However, we never saw the crash with the direct access of the Type enum anywhere else, only in the PGW after persistence. Right now, I cannot imagine why protobufs should be serialized to disk in a different way than onto the wire, but it looks like it.

I'm investigating the latter to see if that explains it in some way. Might also be possible that the version of Go protobuf code PGW used back then had a different behavior from today's.

juliusv commented 7 years ago

Thanks for the investigation. I also couldn't reproduce the behavior with the current Go code. Maybe it was never legal behavior (we cannot be sure without a clear spec).

So I think that means we have to cancel the proto3 migration?

brian-brazil commented 7 years ago

As I suggested above we could switch from proto2 optional to a proto3 repeated, with it specified that we only accept one value in the repeated. That's backwards compatible, and gives us the right semantics.

beorn7 commented 7 years ago

The repeated way seems like the right way. Also, my suggestion for multiple samples in one proto message would go down the same path (make a number of fields repeated).

@juliusv this was my code suggestion, which would easily translate into the proto3 description. Perhaps hit two birds with one stone?

// ...
message Gauge {
  repeated double value = 1; // One per sample.
}

message Counter {
  repeated double value = 1; // One per sample.
}

message Quantile {
  optional double quantile = 1;
  repeated double value    = 2; // One per sample.
}

message Summary {
  repeated uint64   sample_count = 1; // One per sample.
  repeated double   sample_sum   = 2; // One per sample.
  repeated Quantile quantile     = 3;
}

message Untyped {
  repeated double value = 1;
}

message Histogram {
  repeated uint64 sample_count = 1; // One per sample.
  repeated double sample_sum   = 2; // One per sample.
  repeated Bucket bucket       = 3; // Ordered in increasing order of upper_bound, +Inf bucket is optional.
}

message Bucket {
  repeated uint64 cumulative_count = 1; // Cumulative in increasing order. One per sample.
  optional double upper_bound = 2;      // Inclusive.
}

message Metric {
  repeated LabelPair label        = 1;
  optional Gauge     gauge        = 2;
  optional Counter   counter      = 3;
  optional Summary   summary      = 4;
  optional Untyped   untyped      = 5;
  optional Histogram histogram    = 7;
  repeated int64     timestamp_ms = 6; // One per sample.
}
// ...
beorn7 commented 7 years ago

The plot thickens… I just managed to reproduce the default-value issue with the disk persistence of the PGW and current vendoring… will investigate more.

beorn7 commented 7 years ago

Phew… I finally got it. That was quite a ride.

tl;dr: The encoding behavior we could observe in the PGW do not come from protobuf wire format encoding, but from gob encoding — a weird chain of circumstances masked that and created this big and fat red herring. Despite the unclarity in the standard, all what @brian-brazil said is true, at least for the Go implementation.

Conclusion: We can safely assume that proto2 gives us the "unset-detection" behavior we would like to have. However, the Prometheus server is only using it for the timestamp right now and otherwise always uses the default value where an optional value is not set in a protobuf message. That part is underspecified in the exposition format (i.e. we don't specify if "unset" implies "use the implicit default value" or "don't use this valuet"), so I guess we can consider a change of behavior here a bug fix (rather than a breaking change). I suggest to do this all in one go with the move to proto3 and repeated fields, i.e. at the same time we change the code to use the proto3 API, we also change it to detect unset values and act accordingly (like not adding a ..._count series if no count is set). We can even do the multi-sample-in-one-proto-message in that same go.

The long version of the gob encoding issue will be described in a bug filed against the PGW (because the whole investigated exposed a bug in the encoding).

brian-brazil commented 7 years ago

That part is underspecified in the exposition format (i.e. we don't specify if "unset" implies "use the implicit default value" or "don't use this valuet"),

I think this might be implicitly specified due to "optional HELP and TYPE lines first.", as you currently get different behaviour for a summary missing a _count so the TYPE is not fully optional.

I suggest to do this all in one go with the move to proto3 and repeated fields, i.e. at the same time we change the code to use the proto3 API, we also change it to detect unset values and act accordingly

I'd propose we do it in a separate commit, so that we can distinguish bugs from the 2->3 change from the expected effects of this.

beorn7 commented 7 years ago

In case you are interested: Here the gob story: https://github.com/prometheus/pushgateway/issues/90

beorn7 commented 7 years ago

I'd propose we do it in a separate commit, so that we can distinguish bugs from the 2->3 change from the expected effects of this.

Do you suggest to first introduce "unset detection" everywhere and then proto3? Or the other way round? Both touches the same code. Doing it separately might just double the work and the opportunities to introduce bugs.

brian-brazil commented 7 years ago

I don't have a preference for the ordering, it's down to whoever does it really.

juliusv commented 7 years ago

I propose that I fix up my proto3 changes to do the repeated fix for the timestamp, and then we can see about fixing all the rest. I think I probably can't set the _count and _sum to be repeated yet, because that would basically require also fixing the bug around them (at least if we really don't want to fix it in one go).

snarlysodboxer commented 7 years ago

Hey guys, I'm adding exposure of metrics in the Protocol Buffer format to the Node JS client library referenced in your docs. I have it working with the metrics.proto you have in the proto3 branch, but now I realize I'm unsure if that version is ready to use.

Do I need to switch to using your proto2 metrics.proto from the master branch?

beorn7 commented 7 years ago

Yeah, this effort to move to proto3 got stalled. I think it would be great to get there, but there are quite a few loose ends to tie up, as you can see documented in this issue. But there are even more by now:

snarlysodboxer commented 7 years ago

Thanks! I'll switch to proto2 for now.

snarlysodboxer commented 7 years ago

@beorn7 Are you suggesting Prometheus 2.X might drop protobuf support permanently or just temporarily? Is there a place we can track this information?

beorn7 commented 7 years ago

@snarlysodboxer 2.x is still early alpha, so nothing conclusive can be said at the moment. There is busy development ongoing in the dev-2.0 branch, see https://github.com/prometheus/prometheus/tree/dev-2.0 . I'm not really involved, so I'll refrain from any guessing about the final state.

beorn7 commented 5 years ago

I think proto3 conversion of our traditional protobuf based format is not on the agenda for now, given that the Prometheus server doesn't even speak protobuf at all anymore (for scraping, that is).

IIUC OpenMetrics will specify an on-the-wire format similar to the Prometheus text format, but it will define its data model as protobuf. This might or might not use proto2 or proto3, and it will have to find its own notion of unset vs. default. In any case, if there is a future for protobuf in the Prometheus ecosystem, it will not be shaped in this repo but via the OpenMetrics standard.

Thus, I'm closing this as won't fix.