tony-framework / TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
https://tony-project.ai
Other
708 stars 164 forks source link

Feature request: Add Google Cloud Bucket support #93

Closed gogasca closed 5 years ago

gogasca commented 5 years ago

When running DataProc with Google Cloud, would be ideal to keep files in Company GCS bucket, private or public

Support for:

Since some GCS buckets are not public may be required to pass credentials (json file) in a different parameter. Code sample here.

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/env/tf19.zip \
--src_dir=gs://tony-staging/tony/mnist/src/ \
--executes=gs://tony-staging/tony/mnist/src/mnist_distributed.py \
--conf_file=gs://tony-staging/tony/conf/tony.xml \
--python_binary_path=tf19/bin/python3.5

Related to #74

oliverhu commented 5 years ago

We added a flag in #59 so that you can add a list of resources (which will be localized to the containers from HDFS). So for --jars --src_dir --conf_file, you can just

--resources  hdfs://tony-staging/tony-cli-0.1.5-all.jar
--resources hdfs://tony-staging/tony/mnist/src
--resources hdfs://tony-staging/tony/conf/tony.xml

I have never tested if default gs works well with YARN tho.

oliverhu commented 5 years ago

For gs:// support, it looks like out of box if you're running inside Google Cloud (suppose Google would config its Hadoop cluster with the Google Cloud Connector)?

Ref: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

gogasca commented 5 years ago

Hi @oliverhu

When I run this command

gcloud dataproc jobs submit hadoop --cluster tony-staging --class com.linkedin.tony.cli.ClusterSubmitter --jars gs://tony-staging/tony-cli-0.1.5-all.jar -- --python_venv=/usr/local/src/jobs/TFJob/env/tf112.
zip --src_dir=/usr/local/src/jobs/TFJob/src --executes=/usr/local/src/jobs/TFJob/src/mnist_distributed.py --task_params='--data_dir /tmp/data/ --working_dir /tmp/output' --conf_file=/usr/local/src/jobs/TFJob/tony.xml --python_binary_path=tf112/
bin/python3.5

DataProc by default can read GCS paths, but only --jars file.

18/11/20 17:51:19 INFO cli.ClusterSubmitter: Copying /tmp/b1f13f25d47f40e7a6fc4f6c53d95e7d/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/718d7b18-e35b-49d1-9baa-6e3be9b7d5e6

Questions: 1) Can you clarify a little bit about the --resources flag, I'm not sure how to add it/use it. 2) My understanding is that I can submit a PR to add support to read GCS paths from the parameters used by TonY client?

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/env/tf19.zip \
--src_dir=gs://tony-staging/tony/mnist/src/ \
--executes=gs://tony-staging/tony/mnist/src/mnist_distributed.py \
--conf_file=gs://tony-staging/tony/conf/tony.xml \
--python_binary_path=tf19/bin/python3.5

Example output

Job [b1f13f25d47f40e7a6fc4f6c53d95e7d] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/b1f13f25d47f40e7a6fc4f6c53d95e7d/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/20 17:51:18 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/11/20 17:51:18 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, b1f13f25d47f40e7a6fc4f6c53d95e7d-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/
hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/11/20 17:51:19 INFO cli.ClusterSubmitter: Copying /tmp/b1f13f25d47f40e7a6fc4f6c53d95e7d/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/718d7b18-e35b-49d1-9baa-6e3be9b7d5e6
18/11/20 17:51:20 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/20 17:51:20 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
oliverhu commented 5 years ago

@gogasca

  1. I'll update the wiki on how to use the --resources flag sometime today.
  2. Sure! feel free to submit PR and ask questions :-)
gogasca commented 5 years ago

We rebuild the jar file successfully with the Google cloud connector, but looks like we may need to implement the GCS support into Tony. Currently we are not able to read from a remote GCS path (gs://):

external: [ "assertj": "org.assertj:assertj-core:3.6.2", "avro": "org.apache.avro:avro:1.8.2", "awaitility": "org.awaitility:awaitility:2.0.0", "commons_io": "commons-io:commons-io:2.4", "guava": "com.google.guava:guava:16.0.1", "guice": "com.google.inject:guice:4.1.0", "jackson_databind": "com.fasterxml.jackson.core:jackson-databind:2.8.3", "jackson_dataformat_yaml": "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.9.6", // Only needed by Hadoop test classes "junit": "junit:junit:4.12", "log4j": "log4j:log4j:1.2.17", "metrics": "com.codahale.metrics:metrics-core:3.0.2", "mockito": "org.mockito:mockito-core:2.23.0", "objenesis": "org.objenesis:objenesis:2.6", "playguice": "com.typesafe.play:play-guice$scalaVersion:$playVersion", "playlogback": "com.typesafe.play:play-logback$scalaVersion:$playVersion", "py4j": "net.sf.py4j:py4j:0.8.2.1", "sshd": "org.apache.sshd:sshd-core:1.1.0", "testng": "org.testng:testng:6.4", "text": "org.apache.commons:commons-text:1.4", "zip4j": "net.lingala.zip4j:zip4j:1.3.2", "gcs-connector": "com.google.cloud.bigdataoss:hadoop3-1.9.10" ]

When we launch the job:

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/tensorflow/tf19.zip \
--src_dir=/usr/local/src/jobs/TFJob/src \
--executes=gs://tony-staging/tensorflow/mnist_distributed.py \
--task_params='--data_dir /tmp/data/ --working_dir /tmp/output' \
--conf_file=gs://tony-staging/tensorflow/tony.xml \
--python_binary_path=tf19/bin/python3.5
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/ba4095888a214fd3a057fce9e6a67896/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/30 19:59:49 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/11/30 19:59:50 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, ba4095888a214fd3a05
7fce9e6a67896-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/11/30 19:59:50 INFO cli.ClusterSubmitter: Copying /tmp/ba4095888a214fd3a057fce9e6a67896/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/eeb04e31-4a8e-477e-b719-f4
a6d8875f3d
18/11/30 19:59:51 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/30 19:59:51 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/30 19:59:51 INFO tony.TonyClient: Starting client..
18/11/30 19:59:51 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/30 19:59:51 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
18/11/30 19:59:51 ERROR tony.TonyClient: Failed to run TonyClient
java.io.FileNotFoundException: gs:/tony-staging/tensorflow/tf19.zip (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at com.linkedin.tony.TonyClient.addFileToZip(TonyClient.java:508)
        at com.linkedin.tony.TonyClient.zipArchive(TonyClient.java:485)
        at com.linkedin.tony.TonyClient.run(TonyClient.java:170)
        at com.linkedin.tony.TonyClient.start(TonyClient.java:718)
        at com.linkedin.tony.cli.ClusterSubmitter.submit(ClusterSubmitter.java:72)
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:85)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
18/11/30 19:59:51 ERROR tony.TonyClient: Application failed to complete successfully
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [ba4095888a214fd3a057fce9e6a67896] entered state [ERROR] while waiting for [DONE].
gogasca@cloudshell:~ (dpe-cloud-mle)$ gcloud dataproc jobs submit hadoop --cluster tony-staging \
> --class com.linkedin.tony.cli.ClusterSubmitter \
> --jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
> --python_venv=gs:\/\/tony-staging/tensorflow/tf19.zip \
Welcome to Cloud Shell! Type "help" to get started.
Your Cloud Platform project in this session is set to dpe-cloud-mle.
Use “gcloud config set project [PROJECT_ID]” to change to a different project.
oliverhu commented 5 years ago

@gogasca

Time to try tony.worker.resources flag. Could you try this:

gcloud dataproc jobs submit hadoop --cluster tony-staging \ --class com.linkedin.tony.cli.ClusterSubmitter \ --jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \ --src_dir=/usr/local/src/jobs/TFJob/src \ --task_params='--data_dir /tmp/data/ --working_dir /tmp/output' \ --conf_file PATH_TO_LOCAL_CONF_FILE --conf tony.worker.resources=gs://tony-staging/tensorflow/ \ --conf tony.ps.resources=gs://tony-staging/tensorflow/ \ --executes 'unzip tf19.zip && tf19/bin/python3.5 mnist_distributed.py'

I don't think it makes sense to store conf in gs:// tho

gogasca commented 5 years ago

I gave Public access to our bucket: gs://tony-staging/ @frankyn

#gsutil ls gs://tony-staging/

gs://tony-staging/tony-cli-0.1.5-all.jar
gs://tony-staging/tony.xml
gs://tony-staging/google-cloud-dataproc-metainfo/
gs://tony-staging/pytorch/
gs://tony-staging/tensorflow/

But I got this error in one of the containers:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "AMRM Callback Handler Thread" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.IllegalArgumentException: Wrong FS: gs://tony-staging/tensorflow, expected: hdfs://tony-staging-m
    at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:368)
Caused by: java.lang.IllegalArgumentException: Wrong FS: gs://tony-staging/tensorflow, expected: hdfs://tony-staging-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:774)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:215)
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:983)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:116)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1047)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1057)
    at com.linkedin.tony.Utils.addResource(Utils.java:396)
    at com.linkedin.tony.TonyApplicationMaster$ContainerLauncher.run(TonyApplicationMaster.java:1034)
    at com.linkedin.tony.TonyApplicationMaster$RMCallbackHandler.onContainersAllocated(TonyApplicationMaster.java:979)

Complete error:

gcloud dataproc jobs submit hadoop --cluster tony-staging --class com.linkedin.tony.cli.ClusterSubmitter --jars gs://tony-staging/tony-cli-0.1.5-all.jar -- --src_dir=/usr/local/src/jobs/TFJob/src --task_par
ams='--data_dir /tmp/data/ --working_dir /tmp/output' --conf_file=/usr/local/src/jobs/TFJob/tony.xml --conf tony.worker.resources=gs://tony-staging/tensorflow/ --conf tony.ps.resources=gs://tony-staging/tensorflow/ --executes 'unzip tf19.zip &&
 tf19/bin/python3.5 mnist_distributed.py'
Job [4225e81233524ef6acf50e1ab86b84c2] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/4225e81233524ef6acf50e1ab86b84c2/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/01 22:56:39 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/12/01 22:56:40 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 4225e81233524ef6acf50e1ab86b84c2-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/
hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/12/01 22:56:40 INFO cli.ClusterSubmitter: Copying /tmp/4225e81233524ef6acf50e1ab86b84c2/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/7cdb1869-0471-42cb-ad98-41c609884a81
18/12/01 22:56:41 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/12/01 22:56:41 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/12/01 22:56:41 INFO tony.TonyClient: Starting client..
18/12/01 22:56:41 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/12/01 22:56:42 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
18/12/01 22:56:42 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --task_params '--data_dir /tmp/data/ --worki
ng_dir /tmp/output' --executes unzip tf19.zip && tf19/bin/python3.5 mnist_distributed.py --hdfs_classpath hdfs://tony-staging-m/user/root/.tony/7cdb1869-0471-42cb-ad98-41c609884a81 --container_env TONY_CONF_PATH=hdfs://tony-staging-m/user/root/
.tony/application_1542587994073_0053/tony-final.xml --container_env TONY_CONF_TIMESTAMP=1543705002906 --container_env TONY_CONF_LENGTH=181585 --container_env TONY_ZIP_PATH=hdfs://tony-staging-m/user/root/.tony/application_1542587994073_0053/ton
y.zip --container_env TONY_ZIP_LENGTH=7331 --container_env TONY_ZIP_TIMESTAMP=1543705002447 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS
>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/12/01 22:56:42 INFO tony.TonyClient: Submitting YARN application
18/12/01 22:56:42 INFO impl.YarnClientImpl: Submitted application application_1542587994073_0053
18/12/01 22:56:42 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tony-staging-m:8088/proxy/application_1542587994073_0053/
18/12/01 22:56:42 INFO tony.TonyClient: ResourceManager web address for application: http://tony-staging-m:8088/cluster/app/application_1542587994073_0053
18/12/01 22:56:47 INFO tony.TonyClient: AM host: tony-staging-w-0.c.dpe-cloud-mle.internal
18/12/01 22:56:47 INFO tony.TonyClient: AM RPC port: 12886
18/12/01 22:56:47 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/12/01 22:56:47 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
18/12/01 23:09:56 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "tony-staging-m.c.dpe-cloud-mle.internal/10.138.0.2"; destination host is: "tony-staging-w-0.c.dpe-cloud-mle.internal":12886
; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/01 23:10:27 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:28 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:29 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:30 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:31 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:32 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:33 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:34 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:35 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:36 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:10:36 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.c.dpe-cloud-mle.internal/10.138.0.2 to tony-staging-w-0.c.dpe-cloud-mle.internal:12886 failed on connection exception: java.net.ConnectExce
ption: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/01 23:11:07 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:08 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:09 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:10 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:11 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:12 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:13 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:14 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:15 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:16 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:16 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.c.dpe-cloud-mle.internal/10.138.0.2 to tony-staging-w-0.c.dpe-cloud-mle.internal:12886 failed on connection exception: java.net.ConnectExce
ption: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/01 23:11:47 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:48 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:49 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:50 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:51 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:52 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:53 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:54 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:55 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:56 INFO ipc.Client: Retrying connect to server: tony-staging-w-0.c.dpe-cloud-mle.internal/10.138.0.4:12886. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/01 23:11:56 ERROR tony.TonyClient: Failed to run TonyClient
java.net.ConnectException: Call From tony-staging-m.c.dpe-cloud-mle.internal/10.138.0.2 to tony-staging-w-0.c.dpe-cloud-mle.internal:12886 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  htt
p://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1500)
        at org.apache.hadoop.ipc.Client.call(Client.java:1442)
        at org.apache.hadoop.ipc.Client.call(Client.java:1352)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy16.getTaskUrls(Unknown Source)
        at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskUrls(TensorFlowClusterPBClientImpl.java:75)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy17.getTaskUrls(Unknown Source)
        at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskUrls(ApplicationRpcClient.java:109)
        at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:644)
        at com.linkedin.tony.TonyClient.run(TonyClient.java:208)
        at com.linkedin.tony.TonyClient.start(TonyClient.java:718)
        at com.linkedin.tony.cli.ClusterSubmitter.submit(ClusterSubmitter.java:72)
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:85)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:690)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:793)
        at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
        at org.apache.hadoop.ipc.Client.call(Client.java:1388)
        ... 25 more
18/12/01 23:11:56 ERROR tony.TonyClient: Application failed to complete successfully
oliverhu commented 5 years ago

Some ideas from our side, the related error in Tony code base

Caused by: java.lang.IllegalArgumentException: Wrong FS: gs://tony-staging/tensorflow, expected: hdfs://tony-staging-m
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1057)
    at com.linkedin.tony.Utils.addResource(Utils.java:396)

Basically the Hadoop fs configuration is not right and it still uses hdfs implementation of FileSystem

gogasca commented 5 years ago

Thanks for taking a look, will review it, seems to be: mapreduce.job.working.dir may help

oliverhu commented 5 years ago

I think this could be closed now

josezenteno1992github commented 4 years ago

Hi everyone, do you know if this is resolved? Trying to have Dataproc hadoop sqoop job read password from gs bucket --username=$USERNAME --password-file=gs://test/passwordFile.txt but I keep getting the error Wrong FS: gs://test/passwordFile.txt, expected: hdfs://trailer-m. Any advice is appreciated!

oliverhu commented 4 years ago

@gogasca idea?

kumgaurav commented 3 years ago

try like this: val hadoop_conf = spark.sparkContext.hadoopConfiguration hadoop_conf.set("fs.default.name", bucketName); val fs = new Path(bucketName).getFileSystem(hadoop_conf)