googleapis / google-cloud-java

Google Cloud Client Library for Java
https://cloud.google.com/java/docs/reference
Apache License 2.0
1.88k stars 1.06k forks source link

Authentication error after upgrading to 0.23.1 #2453

Closed lbergelson closed 5 years ago

lbergelson commented 6 years ago

We've started seeing an authentication error in our project after we upgraded to 0.23.1, the issue also seems to be present in 0.24.0. Reverting to 0.22.0 solves the issue.

We start seeing the following 404 error when running a spark application that uses NIO to access gcs files:

code:      0
message:   Error code 404 trying to get security access token from Compute Engine metadata for the default service account. This may be because the virtual machine instance does not have permission scopes specified.
reason:    null
location:  null
retryable: false
com.google.cloud.storage.StorageException: Error code 404 trying to get security access token from Compute Engine metadata for the default service account. This may be because the virtual machine instance does not have permission scopes specified.
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:339)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:197)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:194)
    at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:91)
    at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
    at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:194)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:614)
    at java.nio.file.Files.exists(Files.java:2385)
    at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:346)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:206)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:162)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:118)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:87)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:182)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:390)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:370)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:360)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:233)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Error code 404 trying to get security access token from Compute Engine metadata for the default service account. This may be because the virtual machine instance does not have permission scopes specified.
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:137)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:160)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:146)
    at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
    at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:157)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:337)
    ... 32 more
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [cb87810a-0133-42b3-a954-363b62adce39] entered state [ERROR] while waiting for [DONE].

Looking at the dependency updates in this project, it seems like one of the auth libraries updated to version 0.8.0. Could that be the causing the issue?

Is there some new configuration setting we should be using in our gcloud project? Any help would be appreciated.

yihanzhen commented 6 years ago

@droazen They are working on fixing what you were seeing in the previous comment since I last heard from them. I'll follow up with them and get back to you. Really appreciate your patience.

yihanzhen commented 6 years ago

This is from the dataproc team:

The initialization action has been updated to force a Dataproc agent restart after installing the custom connector. I've just verified that job submission is operating correctly after upgrading the connector using the new version of the init action.

lbergelson commented 6 years ago

@hzyi-google

I tried creating a cluster with the initialization action:

gcloud dataproc clusters create louisb-test-cluster --initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh --metadata 'gcs-connector-version=1.8.1' --metadata 'bigquery-connector-version=0.12.1' --num-workers 4 --num-masters 1 --image-version 1.2

The error that @droazen saw is gone, but the 404 error has returned.

com.google.cloud.storage.StorageException: Error code 404 trying to get security access token from Compute Engine metadata for the default service account. This may be because the virtual machine instance does not have permission scopes specified.
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:220)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:415)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:198)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:195)
    at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
    at com.google.cloud.RetryHelper.run(RetryHelper.java:74)
    at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
    at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:195)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:614)
    at java.nio.file.Files.exists(Files.java:2385)
    at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:429)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:206)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:162)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:118)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.<init>(ReadsDataSource.java:87)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:214)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:469)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:449)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:439)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Error code 404 trying to get security access token from Compute Engine metadata for the default service account. This may be because the virtual machine instance does not have permission scopes specified.
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:152)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:175)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:161)
    at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
    at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:161)
    at com.google.cloud.http.CensusHttpModule$CensusHttpRequestInitializer.initialize(CensusHttpModule.java:120)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:412)
    ... 34 more
yihanzhen commented 6 years ago

Ack.

medb commented 6 years ago

@lbergelson May you provide up to date build config for your project and dependency tree output?

Also, you may try to create cluster with latest GCS and BQ connector versions - 1.9.0 and 0.13.0 respectively

medb commented 6 years ago

Looking into provided repro: https://github.com/jean-philippe-martin/nio-auth-repro

lbergelson commented 6 years ago

@medb @jean-philippe-martin Thank you both.

medb commented 6 years ago

This is not a Dataproc specific issue, it is reproducible on vanilla GCE VM too.

Presumably issue is caused by bug in shading configuration in newer versions of the google-cloud-nio library.

Issue is not reproducible locally, because locally it uses client authentication, not a service account.

Fix is here: https://github.com/jean-philippe-martin/nio-auth-repro/pull/1

lbergelson commented 6 years ago

I don't understand what the issue with the shaded jar is. We're unable to use the unshaded jar due to dependency conflicts.

Is there anyone who understands what the issue around the shading is? I discovered one bug in the shaded jar, but fixing it doesn't solve this issue. (https://github.com/GoogleCloudPlatform/google-cloud-java/issues/3540)

One thing of note, is that if I ssh into the dataproc master node, and run the job there using spark-submit instead of passing it through dataproc, I don't encounter the error.

lbergelson commented 6 years ago

@medb How do you reproduce on a vanilla GCE vm? We have not had any trouble running jobs on GCE with new versions of the library other than on dataproc. It would be useful if we could trigger it outside of dataproc to debug it more easily.

medb commented 6 years ago

I was using this repro app to run it locally, on GCE VM and Dataproc cluster.

Locally (my workstation) it always work, but this is just because it does not use service account for authentication, on GCE VM and Dataproc it uses service account.

jean-philippe-martin commented 6 years ago

@medb you're right, I was able to reproduce on a vanilla GCP VM. I started with an Ubuntu 16.04 image with "Allow full access to all Cloud APIs" enabled, then installed openJDK on it and ran the jar - it failed as before, complaining about potential misconfigured scopes.

I also checked the jar that doesn't use the shaded version of NIO, and this worked.

Just to check, I then typed "gcloud auth application-default login", and ran it again. This time, it worked beautifully.

So it looks like there's something about the combination of default service account and shaded NIO jar that causes issues when running on Cloud.

jean-philippe-martin commented 6 years ago

I had an interesting result playing with the way the shading is done for the NIO jar.

The current NIO jar uses a <pattern>com</pattern> section in the maven-shade-plugin section. I changed it to still shade exactly the same things, but via two more specific patterns: <pattern>com.google and <pattern>com.fasterxml.

When used in my repro app, the resulting jar results in the test passing - both in a GCE VM and when used via Dataproc.

This is very promising. The next step is to try this with GATK itself and see if that solves the problem.

medb commented 6 years ago

Great catch, this should be it! Usually com/org/io patterns are too aggressive for shading.

droazen commented 6 years ago

@jean-philippe-martin That's great news! If you open a PR here, we'd be happy to try out the patch with GATK on our end.

jean-philippe-martin commented 6 years ago

@droazen @medb I put up some code at https://github.com/GoogleCloudPlatform/google-cloud-java/pull/3568 with this change, the repro works (the getBucket method passes).

droazen commented 6 years ago

@jean-philippe-martin I tested your branch using GATK, and it appears to completely resolve this issue!

jean-philippe-martin commented 6 years ago

@droazen what wonderful news! That's great!

jean-philippe-martin commented 6 years ago

I was able to identify the root cause, it's https://issues.apache.org/jira/browse/MSHADE-156

droazen commented 6 years ago

I've confirmed that this issue is resolved with the latest release (0.59.0), so this can finally be closed! Thanks to everyone for their assistance over the past year!

lbergelson commented 5 years ago

This long nightmare is over, but I forgot to close the issue. Closing it. Thank you everyone.