privacysandbox / aggregation-service

This repository contains instructions and scripts to set up and test the Privacy Sandbox Aggregation Service
Apache License 2.0
62 stars 30 forks source link

Job status is always RECEIVED #53

Closed yanghuang1028 closed 4 months ago

yanghuang1028 commented 4 months ago

Hi team,

Our aggregation service is deployed successfully. But after creating a job, the job status is always RECEIVED. Do you have some clues about that ? our projectId is ecs-1709881683838

image image

Thanks a lot~~

chasinandrew commented 4 months ago

Assuming based on the screenshot that this is a GCP environment, correct? Does this happen on every job or intermittent jobs?

Jobs can get stuck with the "RECEIVED" status when the instances within the managed instance group (MIG) are not running or have crashed. If service account onboarding has not been completed, the MIG could be in an unhealthy state. Can you confirm that the service account has been onboarded?

To check the MIG’s health in the cloud console:

  1. Navigate to Compute Engine > Instance Groups
  2. Select your instance group then select the errors tab

To check the MIG's health using gcloud CLI: gcloud compute instance-groups managed list-errors <MIG_NAME> --region=<REGION>

yanghuang1028 commented 4 months ago

Hi @chasinandrew ,

Yes, it is a GCP environment and this issue happens on each job.

I checked the MIG's health, and there exists no error, only one warning.

image image

BTW, we have 4 worker VM instances, and I found the 403 errorin three of them. Some of them seem unstable that keep restarting. Could this be the key point ?

image image

Thank you for quickly replying and it really helps !!!

chasinandrew commented 4 months ago

No problem! This could be happening because of the unstable VMs. To help us replicate this could you provide the following info:

  1. Which aggregation service version do you have deployed?
  2. Can you please provide the terraform deployment parameters if they're available?
  3. Can you send the JSON in plaintext or file form with the request and response?
  4. If available, can you send the avro report and output_domain.avro?
yanghuang1028 commented 4 months ago

Hi @chasinandrew ,

1.We used the latest repo(https://github.com/privacysandbox/aggregation-service) to deploy. So is the version v2.4.2 ?

image
  1. The deployment parameters: dev.auto.tfvars.txt
  2. Request & response: request&response.txt
  3. Due that comment doesn't support to attach an avro file, I upload avro files to my github repo. avro report output_domain.avro

Our google cloud link is https://console.cloud.google.com/home/dashboard?project=ecs-1709881683838. but I don't know if you have the permission to access it.

Thank you for helping to delve into the issue~

chasinandrew commented 4 months ago

Thanks @yanghuang1028! This 403 error can happen when onboarding is incomplete. Can you please fill out this onboarding form to register your domain and service account?

yanghuang1028 commented 4 months ago

@chasinandrew We filled out the form a few weeks ago, and your team sent a email to us.

image

Oh, I see. We used a different service account to do this deployment. Could you help us to update the worker service account ? our new worker service account is sa-worker-aggregation-service@ecs-1709881683838.iam.gserviceaccount.com

BTW, we just registered the domain in the production environment. If we do not register the domain of the staging environment, can the aggregation service correctly handle the reports from the staging environment(we can manually change chrome's settings to receive the reports from staging env now)? our staging reporting site is https://adservice-1.stratus.qa.ebay.com/

Thanks again!

hostirosti commented 4 months ago

Hi @yanghuang1028, I recommend to communicate this information through our support email alias. I'll be hiding your previous comment to avoid having that information in the public.

@chasinandrew please move support conversations around onboarding to email.

Re your question on prod vs staging: Your service account is connected to the site that is onboarded --> if the same service account (in the same GCP project) is used to process your reports you'll be able to process them in staging / prod. If a different account is used a separate onboarding request will be required.

yanghuang1028 commented 4 months ago

Hi @hostirosti @chasinandrew

Thanks for protecting our private infomation!

The separate onboarding request is completed, and the job can be processed now. However, the job threw a _TRANSACTION_MANAGER_RETRIESEXCEEDED error when processing.

{
    "job_status": "FINISHED",
    "request_received_at": "2024-05-16T01:19:59.234435Z",
    "request_updated_at": "2024-05-16T01:29:35.184066241Z",
    "job_request_id": "test05",
    "input_data_blob_prefix": "output/output_regular_reports_2024-04-24T02:38:04-07:00.avro",
    "input_data_bucket_name": "tracking_tf_state_bucket",
    "output_data_blob_prefix": "output/summary_report.avro",
    "output_data_bucket_name": "tracking_tf_state_bucket",
    "postback_url": "",
    "result_info": {
        "return_code": "PRIVACY_BUDGET_ERROR",
        "return_message": "com.google.aggregate.adtech.worker.exceptions.AggregationJobProcessException: Exception while consuming privacy budget. Exception message: TRANSACTION_MANAGER_RETRIES_EXCEEDED \n com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.consumePrivacyBudgetUnits(ConcurrentAggregationProcessor.java:466) \n com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.process(ConcurrentAggregationProcessor.java:329) \n com.google.aggregate.adtech.worker.WorkerPullWorkService.run(WorkerPullWorkService.java:142)\nThe root cause is: com.google.scp.operator.cpio.distributedprivacybudgetclient.TransactionEngine$TransactionEngineException: TRANSACTION_MANAGER_RETRIES_EXCEEDED \n com.google.scp.operator.cpio.distributedprivacybudgetclient.TransactionEngineImpl.proceedToNextPhase(TransactionEngineImpl.java:100) \n com.google.scp.operator.cpio.distributedprivacybudgetclient.TransactionEngineImpl.executeDistributedPhase(TransactionEngineImpl.java:196) \n com.google.scp.operator.cpio.distributedprivacybudgetclient.TransactionEngineImpl.executeCurrentPhase(TransactionEngineImpl.java:138)",
        "error_summary": {
            "error_counts": [],
            "error_messages": []
        },
        "finished_at": "2024-05-16T01:29:35.113618072Z"
    },
    "job_parameters": {
        "output_domain_blob_prefix": "domain/output_local_domain.avro",
        "output_domain_bucket_name": "tracking_tf_state_bucket",
        "attribution_report_to": "https://adservice-1.stratus.qa.ebay.com"
    },
    "request_processing_started_at": "2024-05-16T01:20:00.743721759Z"
}

The reports and domain.avro files are as followed: avro report output_domain.avro

BTW, where can I see the detail logs of each job processing on google cloud console ? I can't find it anywhere. Thanks a lot !

yanghuang1028 commented 4 months ago

The job can be processed now, thanks a lot!