Additional logging available to user of Cloud2 to show when a write has been rejected

8none1 commented 3 years ago

PR:

Subsequent issues:

https://github.com/influxdata/idpe/issues/11644
- Adds error field.

Engineering contacts: @rogpeppe , @8none1

Introduction.

Sometimes a user tries to write data in to Cloud 2 which, although being accepted by Gateway, is ultimately rejected and not written to the database. This could have for a number of reasons, but typically this is because of mismatched data types, for example they are trying to write a string to an int field.

Currently the user would receive a 202 HTTP code to say that their write had been accepted. This could be interpreted as a "everything is fine". Then when they come to read that data they find it is missing from their results.

This new feature logs "rejected writes" to the _monitoring bucket for that org_id.

This is an example of a line which would be written to the bucket:

rejected_points,bucket=01234f6701e34dd7,field=somefield,gotType=Float,measurement=somemeasurement,reason=type\ conflict\ in\ batch\ write,wantType=Integer count=1i 1627906197091972750

Information logged to _monitoring bucket about dropped points

When a point is dropped for some reason (for example because of a type clash with a previously written point that has the same series), the client writing the point does not get immediate feedback, because the infrastructure inside InfluxData does not necessarily know of problems when the point is first written.

Previously, points like this were just dropped silently, but this is about to change (or has changed, depending on when you are reading this). Now, whenever a point is dropped an entry is added to the organization's _monitoring bucket with some information about the point and why it was dropped.

The entries that are added will have the rejected_points measurement. Here's an example:

rejected_points,bucket=01234f6701e34dd7,field=somefield,gotType=Float,measurement=somemeasurement,reason=type\ conflict\ in\ batch\ write,wantType=Integer count=1i 1627906197091972750

Note that the field value (count) will always be 1 - all the information of interest is in the tags. The bucket and reason tags will always be present; other tags depend on the error in question.

A brief description of the tag fields and what they mean:

bucket: the bucket ID of the bucket that the point was to have been written to (always present).
reason: a brief textual description of the reason.
field: the field name of the point (always present if the point had a field)
measurement: the measurement of the point (always present if the point had a measurement)
gotType: the type of the field value in the point
wantType: the type that the field value should have been.

Note that all the information about the dropped point is not written to _monitoring (for example, tags are not present). This is deliberate, to try to keep the cardinality of the _monitoring bucket under control.

Here is a more formal description of the entries in the _monitoring bucket, expressed in the CUE language.

// example shows how the above example looks in CUE format.
example: entries: [{
    measurement: "rejected_points"
    tags: {
        reason:      "type conflict in batch write"
        bucket:      "01234f6701e34dd7"
        measurement: "somemeasurement"
        field:       "somefield"
        gotType:     "Float"
        wantType:    "Integer"
    }
    fields: count: int: 1
    timestamp: 1627906197091972750
}]
example: entries: [... #MonitoringBucketEntry]

// #LPEntry maps to a line-protocol entry in canonical form (no duplicate tag or field keys, all keys lexically ordered, with timestamp).
#LPEntry: {
    measurement: string
    tags: [string]:   string
    fields: [string]: #LPFieldValue
    timestamp: int
}

// #LPFieldValue represents a line-protocol field value.
#LPFieldValue: {
    int: int
} | {
    uint: int & >=0
} | {
    float: number
} | {
    bool: bool
} | {
    string: string
}

// #MonitoringBucketEntry constrains the data inside a _monitoring bucket.
#MonitoringBucketEntry:
    #BatchTypeConflict |
    #ExistingTypeConflict |
    #LPEntry        // TODO restrict this more.

// #RejectedPoint constrains the "rejected_points" measurement in the
// _monitoring bucket.
#RejectedPointEntry: #LPEntry & {
    measurement: "rejected_points"
    tags: {
        bucket:      #InfluxID
        field:       string
        measurement: string
        reason:      string
        ...
    }
    fields: count: int: 1
}

// #BatchTypeConflict represents an entry for a point that's
// dropped because it has a type that's in conflict with another
// point that has the same series within the same batch write.
#BatchTypeConflict: #RejectedPointEntry & {
    tags: {
        reason:      "type conflict in batch write"
        gotType:     #DroppedPointType
        wantType:    #DroppedPointType
    }
    fields: count: int: 1
}

// #ExistingTypeConflict represents an entry for a point
// that's dropped because it has a type that's in conflict
// with another point that has the same series that's already
// stored inside InfluxDB.
#ExistingTypeConflict: #RejectedPointEntry & {
    tags: {
        reason:      "type conflict with existing data"
        gotType:     #DroppedPointType
        wantType:    #DroppedPointType
    }
    fields: count: int: 1
}

// #InfluxID constrains an orgranization or bucket ID.
#InfluxID: =~"^[a-z0-f]{16}$"

// #DroppedPointType is one of the possible InfluxDB field types.
#DroppedPointType: "Float" | "Integer" | "UnsignedInteger" | "Boolean" | "String"

timhallinflux commented 3 years ago

As part of this, we need a troubleshooting section in the docs. Everything in the current documentation related to writes assumes a very sunny day scenario.

I would like to see a new Troubleshooting section added here: Screen Shot 2021-08-19 at 4 42 47 PM

We need to describe in detail the potential ways in which writes can fail, what the response codes are for writes, and then include this new content for the exception cases. Write failure scenarios should include:

rate limit failures,
timeouts,
size of payload exceeding limit,
size of HTTP headers,
can writes partially succeed? (under what circumstances are writes rejected at a batch level vs. a partial success...and I believe this leads into this new content).

See Responses section here: https://docs.influxdata.com/influxdb/cloud/api/#operation/PostWrite (and note that the responses listed in the right hand column are incomplete. A more complete list is found in the main body panel.

Additionally, we have a new reason writes could fail... which is based on a payload not conforming to an explicit schema bucket. That should be described and documented as well.

jstirnaman commented 3 years ago

Currently the user would receive a 202 HTTP code to say that their write had been accepted. This could be interpreted as a "everything is fine". Then when they come to read that data they find it is missing from their results.

Success code semantics: a successful or unsuccessful write currently returns 204 No Content. 202 would be more appropriate to indicate that the request succeeded, but the write operation could fail asynchronously.

jstirnaman commented 3 years ago

@8none1 Based on your description and the responses I get from OSS and Cloud currently, the API won't return 400 for a malformed line protocol payload, correct? That's contrary to 400 described in https://docs.influxdata.com/influxdb/cloud/api/#operation/PostWrite. For example, this still returns 204:

  --data-raw "
      memhost=host1 used_percent=25.4345351630076819792518000
      memhost=host2 used_percent=25.4345351630076819792518000
      " \

Is there some other condition that can return a 400? I haven't found it yet.

jstirnaman commented 3 years ago

@8none1 Based on your description and the responses I get from OSS and Cloud currently, the API won't return 400 for a malformed line protocol payload, correct? That's contrary to 400 described in https://docs.influxdata.com/influxdb/cloud/api/#operation/PostWrite. For example, this still returns 204:
  --data-raw "
      memhost=host1 used_percent=25.4345351630076819792518000
      memhost=host2 used_percent=25.4345351630076819792518000
      " \
Is there some other condition that can return a 400? I haven't found it yet.

Nevermind. I'm wrong. I was able to get a 400. That was a bad example that is actually valid LP and did write.

jstirnaman commented 3 years ago

@8none1 or @rogpeppe Has the rejected_points logging been deployed? I'm trying to test it for documentation, but so far haven't seen any in _monitoring.

philjb commented 3 years ago

Has the rejected_points logging been deployed?

It is not available for customers currently.

8none1 commented 2 years ago

Heads up for folks following this feature. We've quietly enabled it on prod101-us-east-1 AWS to see how it performs in the real world. Depending on how that goes, we can start getting ready to roll it out everywhere.

8none1 commented 2 years ago

Hi folks, we're getting ready to enable this for everyone. What do you need to be able to move this forwards? Let us know how we can help.

jstirnaman commented 2 years ago

Hi folks, we're getting ready to enable this for everyone. What do you need to be able to move this forwards? Let us know how we can help.

@8none1 Has anything changed in the monitoring logging since your last update on 9/28? If not, we just need to resolve some conflicts in the branch and push it out.

jstirnaman commented 2 years ago

Resolved conflicts and fixed a few issues in https://github.com/influxdata/docs-v2/pull/3109

8none1 commented 2 years ago

@8none1 Has anything changed in the monitoring logging since your last update on 9/28? If not, we just need to resolve some conflicts in the branch and push it out.

Yoiks! Sorry for the delay. Nothing has changed.

Really excited to see this out in the wild soon!

influxdata / docs-v2

Additional logging available to user of Cloud2 to show when a write has been rejected #3003

Information logged to _monitoring bucket about dropped points