kube-reporting / metering-operator

The Metering Operator is responsible for collecting metrics and other information about what's happening in a Kubernetes cluster, and providing a way to create reports on the collected data.
Apache License 2.0
339 stars 86 forks source link

AWS Billing partitionSpec mismatch #1002

Open SapientGuardian opened 4 years ago

SapientGuardian commented 4 years ago

I'm trying to get my AWS billing report loaded in, and am running into errors like this:

validation failure list: \ nspec.partitions.partitionSpec in body must be of type array: "object" " app=metering component=reportDataSourceWorker error=" HiveTable.metering.openshift.io "reportdatasource-metering-aws-billing" is invalid: []: Invalid value: map[string]

followed by a dump of the HiveTable that's trying to be applied. I've copied the relevant part below, with the table name redacted:

"databaseName": "metering",
        "external": true,
        "location": "s3a://my-redacted-billing-reports-bucket/cur-report/",
        "managePartitions": true,
        "partitionedBy": []interface {} {
            map[string]interface {} {
                "name": "billing_period_start",
                "type": "string"
            },
            map[string]interface {} {
                "name": "billing_period_end",
                "type": "string"
            }
        },
        "partitions": []interface {} {
            map[string]interface {} {
                "location": "s3a://my-redacted-billing-reports-bucket/cur-report/cur-report/20191001-20191101/d0abd31d-5d9b-4af1-80e9-8596bcc7b6e3/",
                "partitionSpec": map[string]interface {} {
                    "end": "20191101",
                    "start": "20191001"
                }
            }
        },
        "rowFormat": "\\nSERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'\\nWITH SERDEPROPERTIES (\\n    \\" serialization.format \\ " = \\",

Looking through the Go source code, it seemed like partitionSpec really should be an object, not a string array. So I modified hive.crd.yaml from this...

partitions:
              type: array
              description: |
                A list of partitions that this Hive table should contain.
                Note: this is an optional field.
              items:
                type: object
                required:
                - partitionSpec
                - location
                properties:
                  partitionSpec:
                    type: array
                    description: |
                      PartitionSpec is a map containing string keys and values, where each key
                      is expected to be the name of a partition column, and the value is the
                      value of the partition column.
                  location:
                    type: string
                    description: |
                      Location specifies where the data for this partition is stored.
                      This should be a sub-directory of the "location" field.
                    minLength: 1
                    format: uri

to this...

            partitions:
              type: array
              description: |
                A list of partitions that this Hive table should contain.
                Note: this is an optional field.
              items:
                type: object
                required:
                - partitionSpec
                - location
                properties:
                  partitionSpec:
                    type: object
                    description: |
                      PartitionSpec is a map containing string keys and values, where each key
                      is expected to be the name of a partition column, and the value is the
                      value of the partition column.
                  location:
                    type: string
                    description: |
                      Location specifies where the data for this partition is stored.
                      This should be a sub-directory of the "location" field.
                    minLength: 1
                    format: uri

That allowed the HiveTable update to actually go through, but then I started getting errors like this:

time="2019-10-23T14:04:24Z" level=error msg="error syncing HiveTable reportdatasource-metering-aws-billing" app=metering component=hiveTableWorker error="failed to add partition `billing_period_start`=``, `billing_period_end`=`` location s3a://my-redacted-billing-reports-bucket/cur-report/cur-report/20191001-20191101/d0abd31d-5d9b-4af1-80e9-8596bcc7b6e3/ to Hive table \"datasource_metering_aws_billing\": hive: query failed. errmsg=Error while compiling statement: FAILED: ParseException line 1:96 cannot recognize input near '' ',' 'billing_period_end' in constant" hiveTable=reportdatasource-metering-aws-billing logID=hsAWCI9NrM namespace=metering

I couldn't reconcile this behavior with the Go code. aws_usage_hive.go seems to hard-code the names as "start" and "end":

        p := metering.HiveTablePartition{
            Location: location,
            PartitionSpec: hive.PartitionSpec{
                "start": start,
                "end":   end,
            },
        }

But db.go seems to look for keys matching the column names, which I guess are billing_period_start and billing_period_end in this case:

func FmtPartitionSpec(partitionColumns []hive.Column, partSpec hive.PartitionSpec) string {
    var partitionVals []string
    for _, col := range partitionColumns {
        val := partSpec[col.Name]
        // Quote strings
        if strings.ToLower(col.Type) == "string" {
            val = "`" + val + "`"
        }
        partitionVals = append(partitionVals, fmt.Sprintf("`%s`=%s", col.Name, val))
    }

What's going on here? It seems like there's a mismatch between the hivetable spec used by aws_usage_hive.go, the hive crd, and db.go

chancez commented 4 years ago

The CRD validation was added much later so it's entirely possible we messed that up, and we haven't had access to an AWS account with billing data in it until recently, and we've been hitting issues with that (unrelated to any of this though) so it's been problematic for us to test it recently.

The thing to know is that columns and partition columns are separate and you can't use regular columns as partition columns. So when you define the partitions, it's like defining another column, but it's a partition column, hence the usage of hive.Column in FmtPartitionSpec. But the actual partitions themselves are a separate list of locations that we can point to in s3/hdfs/etc for specific partition values. For example if start/end is 02012019 and 03012019 it might point at s3a://my-bucket/report/02012019-03012019or something.

It would be easier to debug if you could get the hive metastore and hive server logs (just the metastore and hiveserver2 container logs is fine). That said I see the issues you've pointed out as most likely candidates for some of these problems.

If I built a custom reporting-operator image would you be able to try it out to see if it resolves the issue for you? As I mentioned, our AWS billing reports are giving us issues in other ways (preventing us from even scanning the bucket for reports) due to how they were setup and I've got no real way to test this right now.

SapientGuardian commented 4 years ago

I've nearly made this work, with the following changes:

Partitionspec field names need to match the columns On line 154 of pkg\operator\aws_usage_hive.go Change the from this

p := metering.HiveTablePartition{
            Location: location,
            PartitionSpec: hive.PartitionSpec{
                "start": start,
                "end":   end,
            },
        }

to this:

p := metering.HiveTablePartition{
            Location: location,
            PartitionSpec: hive.PartitionSpec{
                "billing_period_start": start,
                "billing_period_end":   end,
            },
        }

HiveTable CRD needs to accept an object/map, not string array for partitionSpec

Change hive.crd.yml from this...

partitions:
              type: array
              description: |
                A list of partitions that this Hive table should contain.
                Note: this is an optional field.
              items:
                type: object
                required:
                - partitionSpec
                - location
                properties:
                  partitionSpec:
                    type: array
                    description: |
                      PartitionSpec is a map containing string keys and values, where each key
                      is expected to be the name of a partition column, and the value is the
                      value of the partition column.
                  location:
                    type: string
                    description: |
                      Location specifies where the data for this partition is stored.
                      This should be a sub-directory of the "location" field.
                    minLength: 1
                    format: uri

to this...

            partitions:
              type: array
              description: |
                A list of partitions that this Hive table should contain.
                Note: this is an optional field.
              items:
                type: object
                required:
                - partitionSpec
                - location
                properties:
                  partitionSpec:
                    type: object
                    description: |
                      PartitionSpec is a map containing string keys and values, where each key
                      is expected to be the name of a partition column, and the value is the
                      value of the partition column.
                  location:
                    type: string
                    description: |
                      Location specifies where the data for this partition is stored.
                      This should be a sub-directory of the "location" field.
                    minLength: 1
                    format: uri

Literals need to be escaped with single quotes, not backticks.

pkg\operator\reporting\db.go line 73 Change this:

val = "`" + val + "`"

to this:

val = "'" + val + "'"

LOCATION needs to be quoted

db.go line 55 From this:

locationStr = "LOCATION " + partition.Location

to this:

locationStr = "LOCATION '" + partition.Location + "'"

With all of those changes, the partition statement looks plausible. It still fails though, with this:

time="2019-10-23T17:56:05Z" level=error msg="error syncing HiveTable reportdatasource-metering-aws-billing" app=metering component=hiveTableWorker error="failed to add partition `billing_period_start`='20191001', `billing_period_end`='20191101' location s3a://my-redacted-billing-reports-bucket/cur-report/cur-report/20191001-20191101/d0abd31d-5d9b-4af1-80e9-8596bcc7b6e3/ to Hive table \"datasource_metering_aws_billing\": hive: query failed. errmsg=Error while compiling statement: FAILED: SemanticException [Error 10001]: Table not found datasource_metering_aws_billing" hiveTable=reportdatasource-metering-aws-billing
chancez commented 4 years ago

Yep these are the same changes I was looking at minus the quote changes, but those seem necessary too as we used the wrong quotes. That last one is probably caused by https://github.com/operator-framework/operator-metering/issues/993, which looks like is the other one you filed. I may try to combine these PRs and bugs to make it easier to fix.

SapientGuardian commented 4 years ago

If I built a custom reporting-operator image would you be able to try it out to see if it resolves the issue for you?

Yeah I'm happy to help test with a custom image. You'll need to get the SDK updated like I have in my private fork, that's the only change I'm running with in that image other than the ones I just mentioned.

chancez commented 4 years ago

@SapientGuardian ill make a separate PR for just that that we can get merged sooner than later since I actually need to do that for the other bug i was hitting that I mentioned.

chancez commented 4 years ago

PR #1004 is to update the SDK.

chancez commented 4 years ago

I have a reporting-operator image which contains the SDK update, datasource + query name fixes, and partitionSpec handling fixes. The image is quay.io/ecnahc515/origin-metering-reporting-operator:fix-aws-datasources1, and you can specify it in your MeteringConfig like so (plus any other modifications to the MeteringConfig you've made):

apiVersion: metering.openshift.io/v1
kind: MeteringConfig
metadata:
  name: "operator-metering"
spec:
  openshift-reporting:
    spec:
      awsBillingReportDataSource:
        enabled: true
        bucket: "your-bucket"
        prefix: "your-report-prefix"
        region: "us-east-1"

  storage:
    type: "hive"
    hive:
      type: "s3"
      s3:
        bucket: "your-stuff"
        secretName: "your-secret"
        createBucket: false

  reporting-operator:
    spec:
      config:
        aws:
          secretName: your-secret
      image:
        repository: quay.io/ecnahc515/origin-metering-reporting-operator
        tag: fix-aws-datasources1
  presto:
    spec:
      config:
        aws:
          secretName: your-secret
  hive:
    spec:
      config:
        aws:
          secretName: your-secret
chancez commented 4 years ago

I also found another environment with billing data but not for the clusters I'm using, but it should work for testing these fixes, so I'm working on verifying it myself as well now.

chancez commented 4 years ago

Oh, you'll also need to run the modified metering-ansible-operator since it has the updates to the templates. Follow https://github.com/operator-framework/operator-metering/blob/master/Documentation/manual-install.md#install-with-a-custom-metering-operator-image for how to install with a custom metering-operator image and use quay.io/ecnahc515/origin-metering-ansible-operator:fix-aws-datasources1 for the image. For example:

export METERING_OPERATOR_IMAGE_REPO=quay.io/ecnahc515/origin-metering-ansible-operator
export METERING_OPERATOR_IMAGE_TAG=fix-aws-datasources1
./hack/openshift-install.sh
chancez commented 4 years ago

Well found a panic as part of this. Making progress though.

SapientGuardian commented 4 years ago

I fired up the fix-aws-datasources1 reporting-operator image. As expected (since I made the same changes) it behaves just like mine does, getting far enough to fail on creating the partition with errmsg=Error while compiling statement: FAILED: SemanticException [Error 10001]: Table not found datasource_metering_aws_billing, .

I haven't tested your ansible image yet, though I'm pretty sure I have all the same changes on mine. Assuming it has all of the active PRs merged in (does it?), the one necessary change I have in my fork that I haven't seen a PR for is setting fsGroup in reporting-operator-deployment.yaml so the non-root user can read the projected token:

    spec:
      securityContext:
        fsGroup: 3001

Is the panic you ran into the one about dbName being empty, coinciding with errors about the metering database not existing yet already existing, at the same time? I was holding off on filing the issue for that until the other things got resolved. I'm not actually sure how to fix that one, I couldn't figure out what the real problem was. But I did figure out a workaround: Don't enable the awsBillingReportDataSource on a fresh install; wait until everything else is created, then enable it.

SapientGuardian commented 4 years ago

I used the hive cli (beeline) to manually run

ALTER TABLE datasource_metering_aws_billing ADD IF NOT EXISTS PARTITION (`billing_period_start`='20191001', `billing_period_end`='20191101') LOCATION 's3a://my-redacted-billing-bucket/cur-report/cur-report/20191001-20191101/d0abd31d-5d9b-4af1-80e9-8596bcc7b6e3/';

and it worked, so I'm not sure why the reporting operator is getting "table not found". Wondering if somehow it's not in the metering database when it tries to run it?

chancez commented 4 years ago

Table not found datasource_metering_aws_billing i believe is because the datasource is named incorrectly, which is what requires the ansible operator change. That said, I don't understand why the partition being added works since I expect the table name to be off.

I did forget to build with the the fsGroup option in the ansible operator though. Just fixed that though.

The panic is that issue, yes. It's because it's checking the spec, not the status of the datasource for the database name causing a few issues.

I'm continuing to test this this week and next so I hope to have verified this myself soon. I'm also looking into how we can mock this out in some tests.

SapientGuardian commented 4 years ago

If it's a clue, while the table did exist in hive, it appeared to be empty (but did have a ton of columns).

Did you push the fsGroup change to the fix-aws-datasources1 tag on the ansible image? I just tried using that, seems like it doesn't have it.

chancez commented 4 years ago

How did you attempt to use it? Can I see your MeteringConfig? It should be updated.

chancez commented 4 years ago

Oh the table not existing issue is database schema not being properly selected when we add partitions I think.

SapientGuardian commented 4 years ago

I used the env vars, this is included in the output of ./hack/upstream-install.sh:

using $METERING_OPERATOR_IMAGE_REPO=quay.io/ecnahc515/origin-metering-ansible-operator to override metering-operator image
using $METERING_OPERATOR_IMAGE_TAG=fix-aws-datasources1 to override metering-operator image tag

Here's my MeteringConfig:

apiVersion: metering.openshift.io/v1
kind: MeteringConfig
metadata:
  name: "operator-metering"
spec:
  disableOCPFeatures: true

  reporting-operator:
    spec:      
      image:
        repository: quay.io/ecnahc515/origin-metering-reporting-operator
        tag: fix-aws-datasources1
      config:
        prometheus:
          url: "http://prometheus-operator-prometheus.monitoring.svc.cluster.local:9090"
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::redacted:role/redacted/redacted-eks-operator-metering
  hive:
    spec:
      image:
        repository: artifactory.redacted.cloud/dsax/origin-metering-hive
        tag: web-identity-token
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::redacted:role/redacted/redacted-eks-operator-metering
  presto:
    spec:
      image:
        repository: artifactory.redacted.cloud/dsax/metering-presto
        tag: latest
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::redacted:role/redacted/redacted-eks-operator-metering
      config:
        connectors:
          s3:
            useInstanceCredentials: false
  storage:
    type: hive
    hive:
      type: "s3"
      s3:
        bucket: "redacted-hive-bucket/"
        createBucket: false
        region: "us-west-2"

  openshift-reporting:
    spec:
      awsBillingReportDataSource:
        enabled: true
        # Replace these with where your AWS billing reports are
        # stored in S3.
        bucket: "redacted-billing-bucket"
        prefix: "cur-report"
        region: "us-west-2"

With your ansible image, the reporting-operator keeps spitting out this error:

time="2019-10-25T18:54:36Z" level=error msg="error syncing ReportDataSource \"metering/aws-billing\", adding back to queue" ReportDataSource=metering/aws-billing app=metering component=reportDataSourceWorker error="could not list retrieve AWS billing report keys: WebIdentityErr: unable to read file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token\ncaused by: open /var/run/secrets/eks.amazonaws.com/serviceaccount/token: permission denied" logID=ZBIpODgcXx

kubectl get deployment/reporting-operator -n metering -o yaml | grep fsGroup returns nothing.

Here's the describe of the ansible pod:

[root@cloud-toolbox operator-metering]$ kubectl describe pod/metering-operator-6d9fc7fcc8-2cwwb -n metering
Name:               metering-operator-6d9fc7fcc8-2cwwb
Namespace:          metering
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-52-199-203.us-west-2.compute.internal/10.52.199.203
Start Time:         Fri, 25 Oct 2019 17:14:19 +0000
Labels:             app=metering-operator
                    pod-template-hash=6d9fc7fcc8
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Running
IP:                 10.52.198.142
Controlled By:      ReplicaSet/metering-operator-6d9fc7fcc8
Containers:
  ansible:
    Container ID:  docker://923a4a30a3aa2caefc7de9d333bb4879e1d0838e6872ee3ab3e885483410f338
    Image:         quay.io/ecnahc515/origin-metering-ansible-operator:fix-aws-datasources1
    Image ID:      docker-pullable://quay.io/ecnahc515/origin-metering-ansible-operator@sha256:ddb03d9b789348ba35fa6411b73eedde9b0a474bce1b01ede584c66194b83a6e
chancez commented 4 years ago

You need to specify the fsGroup in your MeteringConfig:

  reporting-operator:
    spec:
      securityContext:
        fsGroup: 3001
chancez commented 4 years ago

I believe I've got everything working, just verifying everything and making sure some changes are still necessary. I'll post here when I republish the images.

chancez commented 4 years ago

Ok I was able to generate a report finally. I pushed the updated images just now, and I expect it should all work. I'll begin working on opening a PR today and monday and getting bugs sorted we can get this backported into the next 4.2.z release also.

SapientGuardian commented 4 years ago

Seems like we're super close. I ran into one easily fixed issue with the reporting operator: could not list retrieve AWS billing report keys: WebIdentityErr: failed to retrieve credentials\ncaused by: MissingRegion: could not find region configuration which I solved by adding the AWS_REGION env var to the deployment spec. After that, there are no errors, and the partitioning is happening successfully, but...

While the hive table does have a bunch of columns, suggesting it did find the usage manifest, the table has no data in it and the aws-billing ReportDataSource has no times logged for import or metrics. I have about 30GB worth of csv.gz files in the billing bucket. Suggestions?

Edit:

Nevermind, I figured out what's going on here.